Mozilla Firefox based Spiders

Author Bob Clary

Web Spider applications have a multitude of possible applications ranging from search indexes, web site quality assurance testing to testing browser implementations. This article introduces an improved mozilla-based Web Spider application framework which can be adapted for many different uses.

What's New
Install Spider
CSpider.js - A New and Improved Spider Framework
Spider - A Mozilla Firefox Application based upon CSpider.js
Real World Applications of Spider
Links and Stuff

In Web Site Quality Assurance Testing using Mozilla, I introduced the use of a mozilla-based web-spider application called CSpider to perform web site quality assurance. CSpider was based upon work I originally published on DevEdge. While the application did provide useful information for quality assurance testing, it did have a number of limitations which became more and more apparent to me as I used it to test web sites for compatibility with Mozilla-based browsers. Some of the limitations in the older versions are an inability to handle web pages which use "frame-busting" javascript to break out of html framesets, overly complicated and buggy logic for handling page loading and a lack of clarity in how the user-defined spider "event handlers" could be used to extend the basic framework.

Spider eliminates the limitations of the earlier version by separating the cross-browser JavaScript Class CSpider and implementing a basic framework for implementing spider applications in Mozilla for HTML, Remote XUL and Chrome applications.

CSpider.js - A New and Improved Spider Framework

The new version of the CSpider framework ( manpage, JavaScript Source) incorporates a number of changes over the older versions. It includes a different 'state' transition scheme which simplifies handling page loading as well as abstracts the actual means of loading new pages thereby removing its dependency on a particular page loading implementation. The basic framework is written in a cross-browser fashion and could be used to create spidering applications for any reasonably competent browser.

Spider - A Mozilla Application based upon CSpider.js

In my recent work, I have been particularly interested in testing how well public web sites support the Mozilla family of browsers as well as testing how well Mozilla held up under real world content. To support this work, I extended the original version of the spidering application to support customizable logging of errors in the JavaScript Console to STDOUT, added the ability to specify a "wait" time to allow visual inspection of loaded pages, and added the ability to extend the behavior of the application through the use of external scripts as well as incorporated Dave Hyatt's Nifty Trick to force a page layout to occur.

The Spider Help page contains instructions on how to configure your Mozilla browser to allow the Spider application the necessary permissions to access the internal APIs necessary for it to operate. In order to run Spider as either an HTML application or Remote XUL application, you need to enable signed.applets.codebase_principal_support to allow extended security privileges and browser.dom.window.dump.enabled to enable JavaScript dump() output to STDOUT.

Spider can be run as a Chrome Application. Spider is limited to operating in Mozilla-based browsers due to the use of internal Mozilla APIs to enable cross-domain access and the ability to capture error logging. You can review all of the source code by browsing the http://hg.mozilla.org/automation/sisyphus/file/tip/spider/ directory.

Note that when run as a Chrome application, Spider can prevent frame-busting code from replacing the top level browser window however when run as either an HTML or Remote XUL application, Spider can be stopped by frame-busting code.

Installing Spider

As a Chrome Application

To install Spider into Mozilla as a Chrome application, shift-click on this link to spider-0.1.0.5-an+fn+fx+sm+tb.xpi, download it to your local machine, then open the spider-0.1.0.5-sm+tb+fx+an+fn.xpi file in your copy of Mozilla. This will install Spider as a Chrome application in Mozilla. You will need to close and restart Mozilla before using Spider. You can invoke Spider as a chrome application by entering chrome://spider/content in Mozilla's URL bar or from the command line as described in the Help documentation.

Extending Spider at run-time

Spider uses XMLHttpRequest to optionally load external scripts which can provide additional features without having to modify the source code. This is done by implementing "user hook" functions in an external script. See Spider Help for more details on implementing external "user hooks".

Real World Applications of Spider

This is all nice and stuff, but what use is it?

Web Site Quality Assurance

As I discussed in Web Site Quality Assurance Testing using Mozilla, Mozilla can be a powerful QA tool especially when combined with a spidering application such as Spider.

JavaScript and CSS Errors

I ran a special build of Mozilla (with CSS Parsing Error reporting enabled), with Popup Windows blocked, JavaScript Strict Warnings, Chrome and XBL errors enabled using a special "user hooks" script test-sites.js against a list (test-sites) of popular web site homepages. The "user hooks" script simply added a test to see if a page was using Standards or Quirks mode and to log the result. See Mozilla's quirks mode for more information about Quirks mode in Mozilla.

The test results where very interesting. :-)

Test Results for Popular Sites
Result	Count
Standards Mode	9
Quirks Mode	61
JavaScript Errors	10
JavaScript Warnings	4002
CSS Errors	382
Chrome Errors	0
XBL Errors	0

Relatively few of these popular sites use Standards mode and are invoking Quirks (Navigator 4.x) compatible layout bugs instead. The proper use of DOCTYPE switching could definitely be improved on these sites.

The JavaScript errors included several due to blocked popup windows which should be a concern for any site when Internet Explorer for Windows XP SP2 becomes available in the fall. Others included blocked access to the history object or out-right errors.

The majority of JavaScript Warnings have to deal with referencing undefined properties which are typically not a problem however there are a number of other Warnings of which sites should be aware. These include redeclaration of variables, functions which do not always return a value, possible inappropriate use of assignment (=) instead of equality (==) in conditionals, and hiding of function arguments.

The large number of CSS Errors is an indicator that the majority of web developer's have difficulty coding valid CSS. If you see layout differences between Mozilla and MSIE, the first thing to do is to invoke Standards mode with the appropriate DOCTYPE and fix all possible CSS Errors.

Valid HTML

To illustrate using other web services from inside of the "user hook" functions, I created validatorhooks.js user hooks file which invokes the W3C Markup Validation Service for each page loaded and uses the number of msg elements returned in the XML output format to determine if a page validated.

Again the results were very interesting. Of the 70 pages (not counting the test-sites page), 63 were definitely invalid, 6 did not return a result from the service and only 1 validated!

Plugins / Rich Media

I used objecthooks.js as an example of using user hook functions in detecting and analysing the use of OBJECT, EMBED SCRIPT FOR EVENT tags in web pages. The results provide another picture of how Spider combined with custom "user hooks" can be used to provide the answer to almost any question about a web site.

Simulating Mouseover and Mouseout Events

Simply loading a web page and logging the CSS and JavaScript errors which occur is useful however it is limited in terms of the code paths which are actually executed. eventhooks.js improves the coverage of the code being tested by sending mouseover and mouseout events to each element on a page. This improves the overall testing of a web site by exercising any roll-over or DHTML menu code on the page. eventhooks.js uses the new (in version 0.0.0.4) ability to "Wait for User Hooks" in order to allow the spider to wait until all mouse events have been sent and processed.

combinedhooks.js combines the Quirks vs. Standard mode detection, the OBJECT/EMBED detection and the eventhooks to provide a more complete analysis of a site.

Sending Log Somewhere Else

It is possible to capture the messages sent to the JavaScript Console and STDOUT using the gConsole.onConsoleMessage user hook function. consolehooks.js uses this new feature introduced in Spider 0.0.0.5 to send the messages to a new window. Note that you must enable popup windows on this site for this example to work.

Automated Test Runner

Another potential use for Spider is as an automated test runner. For example, a page which contained a set of links to test pages each of which set a JavaScript variable to indicate success or failure could be used in combination with Spider to automatically test each page and log the results to file or a data base using the appropriate "user hook" function.

Calling external scripts from user hook functions

Spider 0.0.1.18 introduced a function loadScript(aScriptUrl[, aScope]) which can be used by user hook functions to load additional utility scripts.

Links and Stuff

spider-0.1.0.5-an+fn+fx+sm+tb.xpi
What's New
Documentation
- Spider Help
- CSpider manpage
User Hooks
Related