CSpider

NAME

CSpider - a JavaScript Class for Spidering Web pages

SYNOPSIS




]]>

DESCRIPTION

CSpider is a JavaScript class which is intended to be used as a framework upon which web spidering applications can be built. Through the use of the user overridable methods, CSpider's behavior can be customized to meet the needs of many types of applications.

CSpider uses a helper class to abstract and manage requests to load external documents which must implement the following methods:

Constructor

Constructs an instance of CSpider which will begin spidering at the location aUrl, optionally restricting visited pages to include the string aDomain (if aRestrictUrl is true) up to a depth of aDepth while waiting for aOnLoadTimeoutInterval seconds for each page to load before timing out. aPageLoader is a special helper object used to manager loading external pages. aExtraPrivilges can be set to true in Gecko in order to use netscape.security.PrivilegeManager to request additional security privileges in order to obtain cross-domain access to loaded page's DOM.

If aRespectRobotRules is true, CSpider will load robots.txt from each domain it visits using isRobotBlocked.js and will not add pages to its pending list if they are blocked by the robot rules.

aUserAgent is currently passed to isRobotBlocked() however is not (yet) used to set the user agent of the client. If not passed as an argument, aUserAgent is assumed to be 'Gecko/'.

Properties

mUrl

String containing the initial URL where CSpider will begin. If the initial value obtained from aUrl in the constructor does not begin with 'http://' or 'https://' or 'file://', 'http://' will be prepended to aUrl before it is saved in mUrl.

mDomain

String which will be used (if mRestrictUrl is true) to filter links in loaded documents. If aDomain is not specified, aDomain will be set to the value of aUrl with any leading protocol ('http://' or 'https://') removed.

mRestrictUrl

Boolean which if true will cause CSpider to only follow links which contain the String mDomain.

mDepth

Number indicating the depth (number of clicks) to which CSpider will follow.

mCallWrapperOnLoadPageTimeout

An instance of CCallWrapper which is used to detect timeout conditions when loading pages.

mOnLoadTimeoutInterval

Number (in milliseconds) that CSpider will wait for a page to load before calling mCallWrapperOnLoadpageTimeout.

mExtraPrivileges

Boolean which invokes netscape.security.PrivilegeManager.enablePrivilege to enable cross-domain spidering.

mRespectRobotRules

Boolean which determines if CSpider will load and check urls of pages to be loaded against a site's robots.txt file using isRobotBlocked().

mUserAgent

String which is passed to isRobotBlocked() to check the robot rules however currently defaults to 'Gecko/'.

mPagesVisited

An Array of Strings of the URLS visited by CSpider.

mPagesPending

An Array of CUrl instances representing the pages to be visited.

mPagesHash

A hash of URL strings which are used to determine pages which have already been visited and which do not need to be revisited.

mState

A String indicating the state of the spider. One of 'ready', 'running', 'pausing', 'paused', 'timeout', 'stopping', 'stopped'.

mCurrentUrl

An instance of CUrl representing the page currently being loaded. CUrl is an internal class used to keep track of information related to the currently loaded url.

Constructor
Properties
mDepth

the depth of the URL

mUrl

a String containing the actual URL

mReferer

referer for the request to load this url

mResponses

an array of objects containing captured HTTP response data from loading the url. Note that mResponses[0] contains the responses for the page mUrl while mResponses[i] for i > 0 are the responses for the scripts, CSS files, images, etc loaded by the document at mUrl.

Each object in mResponses[] contains data from the nsIHttpChannel interface in the following properties:

originalURI

The original uri requested

URI

The actual uri returned

referrer

The referrer for the URI

responseStatus

The HTTP status code for the response

responseStatusText

Lower case status in text form, e.g. ok, moved, etc.

contentType

The content type of the document returned

requestSucceeded

Boolean signifying if the request was successful.

mDocument

A reference to the currently loaded Document. This property is set to null in loadPage() and set to the currently loaded document by onLoadPage().

Controller Methods

These methods are used to control the spider.

run

run() begins spidering at mUrl. run() enters the 'running' state, then calls the user specified mOnStart() before loading any pages. mOnStart controls the execution of the spider by returning true to begin loading pages or false to prevent loading pages and re-enter the 'ready' state.

restart

If the spider is in either the 'paused' or 'timeout' states, restart() will enter the 'running' state, then call the user specified mOnRestart() before loading any pages. mOnRestart controls the execution of the spider by returning true to begin loading pages or false to prevent loading pages and enter the 'paused' state.

pause

If mCallWrapperOnLoadPageTimeout is not null, pause() enters the 'pausing' state and then returns to allow either onLoadPage or onLoadPageTimeout to complete before entering the 'paused' state. If mCallWrapperOnLoadPageTimeout is null, pause() enters the 'paused' state then calls mOnPause() to control the execution of the spider by returning true to stay in the paused state or false to enter the 'running' state and call loadPage() to continue spidering.

stop

If mCallWrapperOnLoadPage is not null, stop() enters the 'stopping' sate and then returns to allow either onLoadPage or onLoadPageTimeout to complete before entering the 'stopped' state. If mCallWrapperOnLoadPageTimeout is null, stop() enters the 'stopped' state, then calls mOnStop() to control the execution of the spider by returning true to stay in the stopped state or false to enter the 'running' state and call loadPage() to continue spidering.

Internal Methods

These methods are used internally by CSpider.

init

init() is a private method used to initialize the CSpider.

addPage

addPage(aUrl) is a private method called by onLoadPage() to add the URL aUrl to the list of pending URLs to be visited. addPage limits the pages to be visited by rejecting any URLs which

loadPage

loadPage() initiates the process of loading the next page from the list maintained in mPagesPending. If the spider is not in the 'running' state, loadPage does nothing.

loadPage removes the next CUrl object from the mPagesPending stack and calls stop() if there are no more pages pending. mOnBeforePage() is called to control execution of the spider by returning true to begin the load or false to prevent the load and to enter the 'paused' state. If mOnBeforePage prevents the load, the CUrl object is placed back on the mPagesPending stack.

An instance of a CCallWrapper for onLoadPageTimeout is created, saved to mCallWrapperOnLoadPageTimeout and asynchronously executed to detect page timeouts and the mPageLoader's load() method is called to initiate the loading of the next page.

onLoadPageTimeout

If the mCallWrapperOnLoadPageTimeout has not been cancelled by onLoadPage, onLoadPageTimeout will execute.

If the spider is in the 'pausing' state, pause() is called and the spider enters the 'paused' state. If the spider is in the 'stopping' state, stop() is called and the spider enters the 'stopped' state.

The user defined mOnPageTimeout() is called to control the execution of the spider. If mOnPageTimeout returns true the spider will enter the 'timeout' state, otherwise the spider enters the 'running' state and loadPage will be called to continue spidering the site.

onLoadPage

onLoadPage() is called by the implementation of the page loader when each page has completed loading. onLoadPage will cancel mCallWrapperOnLoadPageTimeout

If the spider is in the 'pausing' state, pause() is called and the spider enters the 'paused' state. If the spider is in the 'stopping' state, stop() is called and the spider enters the 'stopped' state.

onLoadPage adds the current mCurrentUrl to the mPagesVisited array, saves a reference to the loaded document in mDocument and call addPage for each link, frame and iframe in the document.

onLoadPage calls the user specified mOnAfterPage to control the execution of the spider. If mOnAfterPage returns true, onLoadPage will call loadPage to begin loading the next page. if mOnAfterPage returns false, the spider will enter the 'paused' state.

cancelLoadPage

cancelLoadPage() calls mCallWrapperOnLoagePageTimeout.cancel() to cancel any pending page load timeouts and mPageLoader.cancel() to cancel any pages being currently loaded by the page loader and puts the current URL back on the mPagesPending stack.

Overridable Methods

The following methods are intended to be customized in order to provide extended functionality to the spider.

mOnStart

mOnStart() is called by run(). Return true to begin loading pages, false to re-enter the 'ready' state.

mOnBeforePage

mOnBeforePage() is called by loadPage(). Return true to continue loading the page, false to enter the 'paused' state. Note that when onBeforePage is called, mCurrentUrl will contain the CUrl object for the next page to be loaded.

mOnAfterPage

mOnAfterPage() is called by onLoadPage after all new links have been added to mPagesPending. Return true to load the next page or false to enter the 'paused' state. Note that when mOnAfterPage is called, mDocument will contain a reference to the currently loaded document.

mOnPageTimeout

mOnPageTimeout() will be called if the page does not complete loading in mOnLoadTimeoutInterval milliseconds. Return true to enter the 'timeout' state, false to attempt to load the next page.

mOnStop

mOnStop() is called by stop. Return true to enter the 'stopped' state, false to attempt to load the next page.

mOnPause

mOnPause() is called by pause. Return true to enter the 'paused' state, false to attempt to load the next page.

mOnRestart

mOnRestart() is called by restart. Return true to load the next page, false to re-enter the 'paused' state.