Spider Help

About Spider

Spider is a Firefox extension which can recursively visit the pages of a web site. It is customizable to perform arbitrary processing on the web content it visits and is especially useful in performing quality assurance tasks on web content. Spider is supported only by Gecko-based browsers.

To learn more about Spider, visit its home page or subscribe to http://bclary.com/home.rss where announcements regarding new versions are made.

This software is a modified version of the CSpider application hosted at http://devedge-temp.mozilla.org/toolbox/examples/2003/CSpider/.

Developers can learn more about the internal operation of the CSpider JavaScript Class which provides the basic framework for this application from CSpider JavaScript Class manpage

Special Considerations

Security Privileges

Spider is capable of capturing error messages and spidering sites from other domains through the use of extended security privileges.

Preferences

Useful Preferences

javascript.options.strict enables Firefox to report on JavaScript strict warnings which are very useful in determining potential problem areas in scripts. javascript.options.showInConsole can be used to report internal browser errors to the JavaScript console and is useful for Firefox developers.

user_pref("javascript.options.strict", true);
user_pref("javascript.options.showInConsole", true);

If you need to Spider sites which require authentication, the following preferences can be useful to automatically negotiate authentication credentials for specific sites and domains.

 // basic authentication
 user_pref("network.negotiate-auth.trusted-uris", site-list);
 user_pref("network.negotiate-auth.delegation-uris", site-list);

 // ntlm authentication
  user_pref("network.automatic-ntlm-auth.allow-proxies", true);
  user_pref("network.automatic-ntlm-auth.trusted-uris", site-list);

 // Confirm user intent whenever URL of the form 
 // http://user:pass@my.site.com is accessed?
 user_pref("network.http.defensive-auth-prompting", false);

where site-list is a string containing a comma-delimited list of protocols, sites or domains. Be careful not to unintentially broaden the scope of sites where authentication negotiation is used since it might expose your credentials to crackers.

In order for automatic negotiation to work, you will need to be logged into your local machine using the same credentials that are used on the remote site. You can find more information about these preferences in Nigel McFarlane's Firefox Hacks (online).

Starting Spider

Spider is run as a standalone instance of Firefox. Spider sends logging output to both the JavaScript console and to STDOUT if the dump function is enabled. Since the JavaScript console is limited in the number of messages it may contain, it is important to also send the logging messages to STDOUT. This allows the results of a Spider run to be collected in a text file for later review and processing. In order to see this output you should start Firefox from the command line and redirect STDOUT to a text file.

The following command will start Firefox, automatically select the profile test, load Spider and direct all output to the file test.log.

      firefox -P test -spider > test.log 2>&1
      

via command line with command line arguments

-spider              Start Spider (required)
-url <url>           Spider site at <url>
-uri <url>           Spider site at <uri>
-domain <domain>     Restrict Spider to urls matching <domain>
-depth <depth>       Spider to depth of <depth>
-timeout <timeout>   Time out Spider if page takes more than <timeout>
                     seconds
-wait <wait>         Pause Spider for <wait> seconds after each page
-hook <hookscript>   Execute Spider <hookscript>
-start               Automatically start Spider
-quit                Automatically quit when finished
-robot               Obey robots.txt
-fileurls            Allow file:// urls
-debug               Debug Spider
-jserrors            Display JavaScript errors
-jswarnings          Display JavaScript warnings
-chromeerrors        Display chrome errors
-xblerrors           Display XBL errors
-csserrors           Display CSS errors
-httprequests        Display HTTP requests
-invisible           Hide loaded page

The previous example can be rewritten as

firefox -P test -spider -url http://bclary.com/ -domain bclary.com -depth 2 -start

Data

URL

URL is the location which you wish to spider. This can be a fully qualified domain such as http://www.example.com/ or a partial domain such as example.com.

Domain

Domain is used to restrict Spider to follow links on a specific domain. If this value is not specified, it will be generated from the URL by removing the protocol and any leading www.

Domain is useful when you wish to start at a given URL but do not wish to restrict Spider to URLs which contain the initial URL.

Depth

Depth is the number of links to follow during the spider. 0 will load only the initial page. 1 will load the initial page plus all pages linked from the initial page.

A specified Depth will reach the same set of pages that an imaginary visitor would reach using the same number of mouse clicks.

Page timeout

The number of seconds that Spider will wait for an external page to load before firing CSpider.mOnPageTimeout() and entering the paused state.

Page wait time

The number of seconds that Spider will wait after a page has displayed before beginning to load the next page. This can be use to allow the user time to visually inspect the page.

Wait for User Hook

Instead of waiting a specified number of seconds before loading the next page, "Wait for User Hook" will cause the spider to wait until the global variable gPageCompleted is set to true.

"Wait for User Hook" is useful for situations where processing of the loaded page's content may take an arbitrary amount of time and where the User Hook functions need to control page transitions.

Autostart

Autostart will cause Spider to automatically begin loading pages when it first loads.

Autoquit

Autoquit will cause Spider to automatically exit when it has completed its run, i.e. reached the Stopped state.

Restrict URLs

If Restrict URLs is checked, then the spider will only follow links which contain the Domain. For example if you enter http://www.example.com/ as the initial URL, the spider will follow links of the form http://www.example.com/help/ but not http://www.foo.example.com/.

If you wish to restrict the spider to a domain, simply enter the partial domain such as example.com which will follow all links which contain example.com.

If you wish the spider to follow any link regardless of site or domain, uncheck Restrict URLs.

Respect robots.txt

If Respect robots.txt is checked, Spider will obey the rules specified in a site's robots.txt file and refuse to load blocked urls.

Allow file urls

If Allow file urls is checked, Spider will follow file:/// urls. Note that this can have security implications since it allows Spider to follow file:// links from arbitrary pages on the web and is not enabled by default.

Starting in Spider 0.0.4.0, Spider will allow you to load a file url as the starting url, but will not follow file links unless you have specified Allow file urls.

This change has been made primarily to allow Spider to be used with local file based test cases and it not intended for use when spidering content on the wild wild web.

Debug spider

If Debug spider is checked, Spider will output debugging messages to the JavaScript Console and STDOUT. Only useful to someone hacking Spider.

Error Output

JavaScript
Errors

Select this option to send JavaScript Errors and Exceptions to stdout.

Warnings

Select this option to send JavaScript Warnings to stdout.

Chrome

Select this option to send Chrome Errors and Warnings to stdout.

XBL

Select this option to send XBL Errors and Warnings to stdout.

CSS Errors

Select this option to send CSS Errors and Warnings to stdout.

User Hooks

Script URL

Script URL is the location of an optional external JavaScript file which can be used to customize the operation of Spider through the use of any or all of the following functions. Note that the global object gSpider exposes the same interface as CSpider.

WARNING: These functions will operate in the chrome security context of the browser.

function userOnStart()
{
  // add custom code here
  // to be called by the Spider's mOnStart handler
}

function userOnBeforePage()
{
  // add custom code here
  // to be called by the Spider's mOnBeforePage handler
}

function userOnAfterPage()
{
  // add custom code here
  // to be called by the Spider's mOnAfterPage handler
  // this function is especially useful for performing
  // tests upon the DOM of a loaded web page.
  //
  // If "Wait for User Hook" is checked, then userOnAfterPage()
  // is responsible for setting the global variable gPageCompleted
  // in order to load the next page.
}

function userOnStop()
{
  // add custom code here
  // to be called by the Spider's mOnStop handler
}

function userOnPause()
{
  // add custom code here
  // to be called by the Spider's mOnPause handler
}

function userOnRestart()
{
  // add custom code here
  // to be called by the Spider's mOnRestart handler
}

function userOnPageTimeout()
{
  // add custom code here
  // to be called by the Spider's mOnPageTimeout handler
}

gConsoleListener.onConsoleMessage = 
function userOnConsoleMessage(s)
{
  // add custom code here to handle
  // the message which was sent to the
  // JavaScript Console and STDOUT.
  // You can use this function to
  // store messages in databases etc.
};

Spider 0.1.18 introduced a function loadScript(aScriptUrl[, aScope]) which can be used by user hook functions to load additional utility scripts. This allows the modularization and reuse of user hook code.

Controls

Run

Run will begin spidering the specified site.

Pause

Pause will cause the spider to enter the Paused state after it finishes loading the current page.

Restart

If the spider is Paused either because you have clicked Pause or the spider has timed out, you can press Restart to continue.

Stop

Stop will stop the spider.

Generate Spider URL

Generate Spider URL will open a new window with a link which can be used to open Spider and populate the Data inputs and optionally automatically Run Spider if Autostart is checked.

Reset

Reset will reset the Data inputs to their default values. Note that the URL containing any pre-existing query string value is not changed via Reset.

License

This software is licensed under the MPL, GPL and LGPL licenses. View source to see the license agreement and read mozilla.org's Mozilla & Netscape Public Licenses for more details.

spider