Spider Help

About Spider
Special Considerations
- Security Privileges
- Preferences
Starting Spider
- Data
- Error Output
- User Hooks
- Controls
License

About Spider

Spider is a web application which can recursively visit the pages of a web site. It is customizable to perform arbitrary processing on the web content it visits and is especially useful in performing quality assurance tasks on web content. Spider is supported only by Mozilla-based browsers. To learn more about Spider, visit its home page or subscribe to http://bclary.com/home.rss where announcements regarding new versions are made.

This software is a modified version of the CSpider application hosted at http://devedge-temp.mozilla.org/toolbox/examples/2003/CSpider/.

Developers can learn more about the internal operation of the CSpider JavaScript Class which provides the basic framework for this application from CSpider JavaScript Class manpage

Special Considerations

Security Privileges

Spider is capable of capturing error messages and spidering sites from other domains through the use of extended security privileges.

If you are running Spider as a XUL application from Mozilla Chrome, the application automatically has the necessary security privileges. However if you are running Spider as an HTML application or a remote XUL application, you will need to grant the application the necessary privileges.

In order to configure Mozilla to enable the use of these security privileges, you must set the preference signed.applets.codebase_principal_support to true. You can do this either by using about:config to add the boolean preference, or modifying the user.js preferences file in your Mozilla profile directory to contain the line

user_pref("signed.applets.codebase_principal_support", true);

For more information, please see Bypassing Security Restrictions and Signing Code by Arun Ranganathan.

Preferences

Required Preferences

signed.applets.codebase_principal_support is required in order to allow Spider access to internal Mozilla APIs as well as cross-domain access. browser.dom.window.dump.enabled is required in order to allow Spider to send messages to STDOUT. Without dump, Spider can only report messages to the JavaScript Console.

user_pref("signed.applets.codebase_principal_support", true);
user_pref("browser.dom.window.dump.enabled", true);

Useful Preferences

javascript.options.strict enables Mozilla to report on JavaScript strict warnings which are very useful in determining potential problem areas in scripts. javascript.options.showInConsole can be used to report internal browser errors to the JavaScript console and is useful for Mozilla developers.

user_pref("javascript.options.strict", true);
user_pref("javascript.options.showInConsole", true);

Spider can be halted through alerts which either Mozilla sends or which are created by the web pages being visited. If you wish to run Spider unattended, you should disable as many alert messages as possible. In addition, by using browser.xul.error_pages.enabled to replace network error alerts with error pages, Spider can easily detect and report on network errors.

user_pref("browser.xul.error_pages.enabled", true);
user_pref("dom.disable_open_during_load", true);
user_pref("dom.disable_window_flip", true);
user_pref("dom.disable_move_resize", true);
user_pref("dom.disable_window_open_feature.status", true);
user_pref("security.warn_entering_secure", false);
user_pref("security.warn_entering_weak", false);
user_pref("security.warn_submit_insecure", false);
user_pref("security.warn_leaving_secure", false);
user_pref("security.warn_viewing_mixed", false);
user_pref("capability.policy.default.Window.alert", "noAccess");
user_pref("capability.policy.default.Window.confirm", "noAccess");
user_pref("capability.policy.default.Window.prompt", "noAccess");

If you wish to enable alerts on specific sites, you can enable them on a per site basis using a policy and a site list for which you will allow alerts.

user_pref("capability.policy.policynames", "trusted");
user_pref("capability.policy.trusted.sites", "...");
user_pref("capability.policy.trusted.Window.alert", "sameOrigin");
user_pref("capability.policy.trusted.Window.confirm", "sameOrigin");
user_pref("capability.policy.trusted.Window.prompt", "sameOrigin");

You can download user.js which contains these preferences and save it to your Mozilla profile directory.

If you need to Spider sites which require authentication, the following preferences can be useful to automatically negotiate authentication credentials for specific sites and domains.

 // basic authentication
 user_pref("network.negotiate-auth.trusted-uris", site-list);
 user_pref("network.negotiate-auth.delegation-uris", site-list);

 // ntlm authentication
  user_pref("network.automatic-ntlm-auth.allow-proxies", true);
  user_pref("network.automatic-ntlm-auth.trusted-uris", site-list);

 // Confirm user intent whenever URL of the form 
 // http://user:pass@my.site.com is accessed?
 user_pref("network.http.defensive-auth-prompting", false);

where site-list is a string containing a comma-delimited list of protocols, sites or domains. Be careful not to unintentially broaden the scope of sites where authentication negotiation is used since it might expose your credentials to crackers.

In order for automatic negotiation to work, you will need to be logged into your local machine using the same credentials that are used on the remote site. You can find more information about these preferences in Nigel McFarlane's Firefox Hacks (online).

Starting Spider

Spider can be run as either an HTML web application, a remote XUL web application or as a chrome XUL application. When run as an HTML application, it uses an IFRAME to load pages which sends a null HTTP referer. When run as a XUL application, Spider uses a xul:browser element along with the method loadURI to send the url of the page containing a link as its referer.

Spider sends logging output to both the JavaScript console and to STDOUT if the dump function is enabled. Since the JavaScript console is limited in the number of messages it may contain, it is important to also send the logging messages to STDOUT. This allows the results of a Spider run to be collected in a text file for later review and processing. In order to see this output you should start Mozilla from the command line and redirect STDOUT to a text file.

The following command will start Mozilla, automatically select the profile test, load Spider and direct all output to the file test.log.

HTML

mozilla -P test http://bclary.com/2004/07/10/spider/chrome/content/spider/spider.html > test.log 2>&1

Remote XUL

mozilla -P test http://bclary.com/2004/07/10/spider/chrome/content/spider/spider.xul > test.log 2>&1

Chrome XUL

Spider can be run directly as in:

mozilla -P test -chrome chrome://spider/content > test.log 2>&1

However, this can have an undesired side-effect of creating a window which can not be resized in recent Firefox trunk (Deer Park alpha) builds based on Gecko 1.8.

To create a resizable window, invoke open.xul as in:


      
      mozilla -P test -chrome chrome://spider/content/open.xul  > test.log 2>&1
      
      open.xul will create a resizable window using
      javascript, then load spider.xul while passing any
      querystring along to spider.xul.
      
    

    Note that when run as a Chrome application, Spider can prevent
    frame-busting code from replacing the top level browser window however when
    run as either an HTML or Remote XUL application, Spider can be stopped by
    frame-busting code.



    Data
    
    URL
    
    
    URL is the location which you wish to spider. This can be a
    fully qualified domain such as http://www.example.com/ or
    a partial domain such as example.com.
    
    Domain
    
    
    Domain is used to restrict Spider to follow links on a specific domain.
    If this value is not specified, it will be generated from the URL by
    removing the protocol and any leading www.
    
    
    Domain is useful when you wish to start at a given URL but do not
    wish to restrict Spider to URLs which contain the initial URL.
    
    
    Depth

    

    
    Depth is the number of links to follow during the spider. 
    0 will load only the initial page. 1 will load the initial page plus
    all pages linked from the initial page.
    
    
    A specified Depth will reach the same set of pages that an 
    imaginary visitor would reach using the same number of mouse 
    clicks.
    
    

    Page timeout
    
    
    The number of seconds that Spider will wait for an external page to 
    load before firing CSpider.mOnPageTimeout() and entering the 
    paused state.
    
    

    Page wait time
    
    
    The number of seconds that Spider will wait after a page has
    displayed before beginning to load the next page. This can be
    use to allow the user time to visually inspect the page.
    
    
    Wait for User Hook
    
    
    Instead of waiting a specified number of seconds before loading the
    next page, "Wait for User Hook" will cause the spider to wait until
    the global variable gPageCompleted is set to true.
    
    
    "Wait for User Hook" is useful for situations where processing of the
    loaded page's content may take an arbitrary amount of time and where
    the User Hook functions need to control page transitions.
    
    
    Autostart
    
    
    If Spider is invoked with a query string containing autostart=on, Spider
    will automatically begin executing with the data values specified
    in the query string.
    
    
    To generate a link containing the desired parameters, first 
    set the data inputs, check the Autostart check box,
    then click Generate Spider URL.
    
    Autoquit
    
    
    If Spider is invoked with a query string containing autoquit=on, Spider
    will automatically exit when it has completed its run, i.e. reached 
    the Stopped state.
    
    
    To generate a link containing the desired parameters, first 
    set the data inputs, check the Autoquit check box,
    then click Generate Spider URL/code>.
    

    Restrict URLs
    
    
    If Restrict URLs is checked, then the spider will only follow links
    which contain the Domain. For example if you enter 
    http://www.example.com/ as the initial URL, 
    the spider will follow links of the form 
    http://www.example.com/help/ but not 
    http://www.foo.example.com/. 
    
    
    If you wish to restrict the spider to a domain, simply enter the
    partial domain such as example.com which will follow
    all links which contain example.com.
    
    
    If you wish the spider to follow any link regardless of site or domain,
    uncheck Restrict URLs.
    
    
    
    Respect robots.txt
    
    
    If Respect robots.txt is checked, Spider will obey the rules specified
    in a site's robots.txt file and refuse to load blocked urls. 
    
    

    Debug spider
    
    
    If Debug spider is checked, Spider will output debugging messages to
    the JavaScript Console and STDOUT. Only useful to someone hacking 
    Spider.
    
    
    


    Error Output
    
    JavaScript
    
    
    Errors
    
    
    Select this option to send JavaScript Errors and Exceptions to stdout.
    
    
    Warnings
    
    
    Select this option to send JavaScript Warnings to stdout.
    
    
    Chrome
    
    
    Select this option to send Chrome Errors and  Warnings to stdout.
    
    
    XBL
    
    
    Select this option to send XBL Errors and  Warnings to stdout.
    
    
    
    
    CSS Errors
    
    
    Select this option to send CSS Errors and  Warnings to stdout.
    
    
    

    User Hooks

    
      Script URL
      
      

      Script URL is the location of an optional external JavaScript file which
      can be used to customize the operation of Spider through the use of any
      or all of the following functions.
      
      
      
      
      
      WARNING: These functions will operate
        in the chrome security context of the browser.

      

      function userOnStart()
{
  // add custom code here
  // to be called by the Spider's mOnStart handler
}

function userOnBeforePage()
{
  // add custom code here
  // to be called by the Spider's mOnBeforePage handler
}

function userOnAfterPage()
{
  // add custom code here
  // to be called by the Spider's mOnAfterPage handler
  // this function is especially useful for performing
  // tests upon the DOM of a loaded web page.
  //
  // If "Wait for User Hook" is checked, then userOnAfterPage()
  // is responsible for setting the global variable gPageCompleted
  // in order to load the next page.
}

function userOnStop()
{
  // add custom code here
  // to be called by the Spider's mOnStop handler
}

function userOnPause()
{
  // add custom code here
  // to be called by the Spider's mOnPause handler
}

function userOnRestart()
{
  // add custom code here
  // to be called by the Spider's mOnRestart handler
}

function userOnPageTimeout()
{
  // add custom code here
  // to be called by the Spider's mOnPageTimeout handler
}

gConsoleListener.onConsoleMessage = 
function userOnConsoleMessage(s)
{
  // add custom code here to handle
  // the message which was sent to the
  // JavaScript Console and STDOUT.
  // You can use this function to
  // store messages in databases etc.
};


      
    


    Controls
    
      Run
      
      
      Run will begin spidering the specified site.
      
      
      Pause
      
      
      Pause will cause the spider to enter the 
      Paused state after it finishes loading the current
      page. 
      
      
      Restart
      
      
      If the spider is Paused either because you 
      have clicked Pause or the spider has 
      timed out, you can press Restart
      to continue.
      

      

      Stop
      
      
      Stop will stop the spider.
      
      
      Generate Spider URL
      
      
      Generate Spider URL will open a new window
      with a link which can be used to open Spider and populate the Data
      inputs and optionally automatically Run
      Spider if Autostart is checked.
      
      
      Reset
      
      
      Reset will reset the Data inputs to their
      default values. Note that the URL containing any pre-existing
      query string value is not changed via Reset.
      
      
    


    License

    

    This software is licensed under the MPL, GPL and LGPL licenses.
    View source to see the license agreement and read mozilla.org's 
    Mozilla & Netscape Public Licenses 
    for more details.

    

    spider