Mozilla Firefox based Spiders
Web Spider applications have a multitude of possible applications ranging from search indexes, web site quality assurance testing to testing browser implementations. This article introduces an improved mozilla-based Web Spider application framework which can be adapted for many different uses.
- What's New
- Install Spider
- CSpider.js - A New and Improved Spider Framework
- Spider - A Mozilla Firefox Application based upon CSpider.js
- Real World Applications of Spider
- Links and Stuff
In Web Site Quality Assurance Testing using Mozilla, I introduced the use of a mozilla-based web-spider application called CSpider to perform web site quality assurance. CSpider was based upon work I originally published on DevEdge. While the application did provide useful information for quality assurance testing, it did have a number of limitations which became more and more apparent to me as I used it to test web sites for compatibility with Mozilla-based browsers. Some of the limitations in the older versions are an inability to handle web pages which use "frame-busting" javascript to break out of html framesets, overly complicated and buggy logic for handling page loading and a lack of clarity in how the user-defined spider "event handlers" could be used to extend the basic framework.
Spider eliminates the limitations of the earlier version by separating
the cross-browser JavaScript Class CSpider
and implementing a basic framework for implementing spider applications
in Mozilla for HTML, Remote XUL and Chrome applications.
CSpider.js - A New and Improved Spider Framework
The new version of the CSpider framework ( manpage, JavaScript Source) incorporates a number of changes over the older versions. It includes a different 'state' transition scheme which simplifies handling page loading as well as abstracts the actual means of loading new pages thereby removing its dependency on a particular page loading implementation. The basic framework is written in a cross-browser fashion and could be used to create spidering applications for any reasonably competent browser.
Spider - A Mozilla Application based upon CSpider.js
In my recent work, I have been particularly interested in testing how well public web sites support the Mozilla family of browsers as well as testing how well Mozilla held up under real world content. To support this work, I extended the original version of the spidering application to support customizable logging of errors in the JavaScript Console to STDOUT, added the ability to specify a "wait" time to allow visual inspection of loaded pages, and added the ability to extend the behavior of the application through the use of external scripts as well as incorporated Dave Hyatt's Nifty Trick to force a page layout to occur.
The Spider
Help page contains instructions on how to configure
your Mozilla browser to allow the Spider application the necessary permissions
to access the internal APIs necessary for it to operate. In order
to run Spider as either an HTML application or Remote XUL application,
you need to enable
signed.applets.codebase_principal_support
to allow extended
security privileges
and
browser.dom.window.dump.enabled
to enable JavaScript
dump()
output to STDOUT.
Spider can be run as a Chrome Application. Spider is limited to operating in Mozilla-based browsers due to the use of internal Mozilla APIs to enable cross-domain access and the ability to capture error logging. You can review all of the source code by browsing the http://hg.mozilla.org/automation/sisyphus/file/tip/spider/ directory.
Note that when run as a Chrome application, Spider can prevent frame-busting code from replacing the top level browser window however when run as either an HTML or Remote XUL application, Spider can be stopped by frame-busting code.
Installing Spider
As a Chrome Application
To install Spider into Mozilla as a Chrome application,
shift-click on
this link to spider-0.1.0.5-an+fn+fx+sm+tb.xpi, download it to your
local machine, then open the spider-0.1.0.5-sm+tb+fx+an+fn.xpi file in your copy of Mozilla. This
will install Spider as a Chrome application in Mozilla. You will need to
close and restart Mozilla before using Spider. You can invoke
Spider as a chrome application by entering chrome://spider/content
in Mozilla's URL bar or from the command line as described in the Help
documentation.
Extending Spider at run-time
Spider uses XMLHttpRequest to optionally load external scripts which can provide additional features without having to modify the source code. This is done by implementing "user hook" functions in an external script. See Spider Help for more details on implementing external "user hooks".
Real World Applications of Spider
This is all nice and stuff, but what use is it?
Web Site Quality Assurance
As I discussed in Web Site Quality Assurance Testing using Mozilla, Mozilla can be a powerful QA tool especially when combined with a spidering application such as Spider.
JavaScript and CSS Errors
I ran a special build of Mozilla (with CSS Parsing Error reporting enabled), with Popup Windows blocked, JavaScript Strict Warnings, Chrome and XBL errors enabled using a special "user hooks" script test-sites.js against a list (test-sites) of popular web site homepages. The "user hooks" script simply added a test to see if a page was using Standards or Quirks mode and to log the result. See Mozilla's quirks mode for more information about Quirks mode in Mozilla.
The test results where very interesting. :-)
Result | Count |
---|---|
Standards Mode | 9 |
Quirks Mode | 61 |
JavaScript Errors | 10 |
JavaScript Warnings | 4002 |
CSS Errors | 382 |
Chrome Errors | 0 |
XBL Errors | 0 |
Relatively few of these popular sites use Standards mode and are invoking Quirks (Navigator 4.x) compatible layout bugs instead. The proper use of DOCTYPE switching could definitely be improved on these sites.
The JavaScript errors included several due to blocked popup windows which
should be a concern for any site when Internet Explorer for Windows XP SP2
becomes available in the fall. Others included blocked access to the
history
object or out-right errors.
The majority of JavaScript Warnings have to deal with referencing undefined properties which are typically not a problem however there are a number of other Warnings of which sites should be aware. These include redeclaration of variables, functions which do not always return a value, possible inappropriate use of assignment (=) instead of equality (==) in conditionals, and hiding of function arguments.
The large number of CSS Errors is an indicator that the majority of web developer's have difficulty coding valid CSS. If you see layout differences between Mozilla and MSIE, the first thing to do is to invoke Standards mode with the appropriate DOCTYPE and fix all possible CSS Errors.
Valid HTML
To illustrate using other web services from inside of the "user hook"
functions, I created validatorhooks.js user hooks file
which invokes the W3C Markup Validation
Service for each page loaded and uses the number of msg
elements returned in the XML output format to determine if a page
validated.
Again the results were very interesting. Of the 70 pages (not counting the test-sites page), 63 were definitely invalid, 6 did not return a result from the service and only 1 validated!
Plugins / Rich Media
I used objecthooks.js as an example of
using user hook functions in detecting and analysing the use of
OBJECT
, EMBED
SCRIPT FOR EVENT
tags
in web pages. The results
provide another picture of how Spider combined with custom "user hooks"
can be used to provide the answer to almost any question about a web site.
Simulating Mouseover and Mouseout Events
Simply loading a web page and logging the CSS and JavaScript errors which
occur is useful however it is limited in terms of the code paths which are
actually executed. eventhooks.js
improves the coverage of the code being tested by sending
mouseover
and mouseout
events to each element on
a page. This improves the overall testing of a web site by exercising any
roll-over or DHTML menu code on the page. eventhooks.js
uses
the new (in version 0.0.0.4) ability to "Wait for User Hooks" in order to
allow the spider to wait until all mouse events have been sent and
processed.
combinedhooks.js combines the Quirks vs. Standard mode detection, the OBJECT/EMBED detection and the eventhooks to provide a more complete analysis of a site.
Sending Log Somewhere Else
It is possible to capture the messages sent to the JavaScript Console and STDOUT using the gConsole.onConsoleMessage user hook function. consolehooks.js uses this new feature introduced in Spider 0.0.0.5 to send the messages to a new window. Note that you must enable popup windows on this site for this example to work.
Automated Test Runner
Another potential use for Spider is as an automated test runner. For example, a page which contained a set of links to test pages each of which set a JavaScript variable to indicate success or failure could be used in combination with Spider to automatically test each page and log the results to file or a data base using the appropriate "user hook" function.
Calling external scripts from user hook functions
Spider 0.0.1.18 introduced a function loadScript(aScriptUrl[, aScope]) which can be used by user hook functions to load additional utility scripts.
Links and Stuff
- spider-0.1.0.5-an+fn+fx+sm+tb.xpi
- What's New
- Documentation
- User Hooks
- Related