CSpider - a JavaScript Class for Spidering Web pages
]]>
CSpider
is a JavaScript class which is intended to be used
as a framework upon which web spidering applications can be built. Through
the use of the user overridable methods, CSpider
's behavior
can be customized to meet the needs of many types of applications.
CSpider
uses a helper class to abstract and manage requests to
load external documents which must implement the following methods:
Constructs an instance of CSpider
which will begin
spidering at the location aUrl
, optionally restricting
visited pages to include the string aDomain
(if
aRestrictUrl
is true) up to a depth of aDepth
while waiting for aOnLoadTimeoutInterval
seconds for
each page to load before timing out. aPageLoader
is
a special helper object used to manager loading external pages.
aExtraPrivilges
can be set to true in Gecko in order
to use netscape.security.PrivilegeManager
to request
additional security privileges in order to obtain cross-domain
access to loaded page's DOM.
If aRespectRobotRules is true, CSpider will load robots.txt from each domain it visits using isRobotBlocked.js and will not add pages to its pending list if they are blocked by the robot rules.
aUserAgent is currently passed to isRobotBlocked() however is not (yet) used to set the user agent of the client. If not passed as an argument, aUserAgent is assumed to be 'Gecko/'.
String containing the initial URL where CSpider will begin. If the initial value obtained from aUrl in the constructor does not begin with 'http://' or 'https://', 'http://' will be prepended to aUrl before it is saved in mUrl.
String which will be used (if mRestrictUrl is true) to filter links in loaded documents. If aDomain is not specified, aDomain will be set to the value of aUrl with any leading protocol ('http://' or 'https://') removed.
Boolean which if true will cause CSpider to only follow links which contain the String mDomain.
Number indicating the depth (number of clicks) to which CSpider will follow.
An instance of CCallWrapper which is used to detect timeout conditions when loading pages.
Number (in milliseconds) that CSpider will wait for a page to load before calling mCallWrapperOnLoadpageTimeout.
Boolean which invokes
netscape.security.PrivilegeManager.enablePrivilege
to enable
cross-domain spidering.
Boolean which determines if CSpider will load and check urls of pages to be loaded against a site's robots.txt file using isRobotBlocked().
String which is passed to isRobotBlocked() to check the robot rules however currently defaults to 'Gecko/'.
An Array of Strings of the URLS visited by CSpider.
An Array of CUrl instances representing the pages to be visited.
A hash of URL strings which are used to determine pages which have already been visited and which do not need to be revisited.
A String indicating the state of the spider. One of 'ready', 'running', 'pausing', 'paused', 'timeout', 'stopping', 'stopped'.
An instance of CUrl representing the page currently being loaded. CUrl is an internal class with two properties (mDepth - the depth of the URL and mUrl a String containing the actual URL).
A reference to the currently loaded Document. This property is set to null in loadPage() and set to the currently loaded document by onLoadPage().
These methods are used to control the spider.
run() begins spidering at mUrl. run() enters the 'running' state, then calls the user specified mOnStart() before loading any pages. mOnStart controls the execution of the spider by returning true to begin loading pages or false to prevent loading pages and re-enter the 'ready' state.
If the spider is in either the 'paused' or 'timeout' states, restart() will enter the 'running' state, then call the user specified mOnRestart() before loading any pages. mOnRestart controls the execution of the spider by returning true to begin loading pages or false to prevent loading pages and enter the 'paused' state.
If mCallWrapperOnLoadPageTimeout is not null, pause() enters the 'pausing' state and then returns to allow either onLoadPage or onLoadPageTimeout to complete before entering the 'paused' state. If mCallWrapperOnLoadPageTimeout is null, pause() enters the 'paused' state then calls mOnPause() to control the execution of the spider by returning true to stay in the paused state or false to enter the 'running' state and call loadPage() to continue spidering.
If mCallWrapperOnLoadPage is not null, stop() enters the 'stopping' sate and then returns to allow either onLoadPage or onLoadPageTimeout to complete before entering the 'stopped' state. If mCallWrapperOnLoadPageTimeout is null, stop() enters the 'stopped' state, then calls mOnStop() to control the execution of the spider by returning true to stay in the stopped state or false to enter the 'running' state and call loadPage() to continue spidering.
These methods are used internally by CSpider.
init() is a private method used to initialize the CSpider.
addPage(aUrl) is a private method called by onLoadPage() to add the URL aUrl to the list of pending URLs to be visited. addPage limits the pages to be visited by rejecting any URLs which
loadPage() initiates the process of loading the next page from the list maintained in mPagesPending. If the spider is not in the 'running' state, loadPage does nothing.
loadPage removes the next CUrl object from the mPagesPending stack and calls stop() if there are no more pages pending. mOnBeforePage() is called to control execution of the spider by returning true to begin the load or false to prevent the load and to enter the 'paused' state. If mOnBeforePage prevents the load, the CUrl object is placed back on the mPagesPending stack.
An instance of a CCallWrapper for onLoadPageTimeout is created, saved to mCallWrapperOnLoadPageTimeout and asynchronously executed to detect page timeouts and the mPageLoader's load() method is called to initiate the loading of the next page.
If the mCallWrapperOnLoadPageTimeout has not been cancelled by onLoadPage, onLoadPageTimeout will execute.
If the spider is in the 'pausing' state, pause() is called and the spider enters the 'paused' state. If the spider is in the 'stopping' state, stop() is called and the spider enters the 'stopped' state.
The user defined mOnPageTimeout() is called to control the execution of the spider. If mOnPageTimeout returns true the spider will enter the 'timeout' state, otherwise the spider enters the 'running' state and loadPage will be called to continue spidering the site.
onLoadPage() is called by the implementation of the page loader when each page has completed loading. onLoadPage will cancel mCallWrapperOnLoadPageTimeout
If the spider is in the 'pausing' state, pause() is called and the spider enters the 'paused' state. If the spider is in the 'stopping' state, stop() is called and the spider enters the 'stopped' state.
onLoadPage adds the current mCurrentUrl to the mPagesVisited array, saves a reference to the loaded document in mDocument and call addPage for each link, frame and iframe in the document.
onLoadPage calls the user specified mOnAfterPage to control the execution of the spider. If mOnAfterPage returns true, onLoadPage will call loadPage to begin loading the next page. if mOnAfterPage returns false, the spider will enter the 'paused' state.
cancelLoadPage() calls mCallWrapperOnLoagePageTimeout.cancel() to cancel any pending page load timeouts and mPageLoader.cancel() to cancel any pages being currently loaded by the page loader and puts the current URL back on the mPagesPending stack.
The following methods are intended to be customized in order to provide extended functionality to the spider.
mOnStart() is called by run(). Return true to begin loading pages, false to re-enter the 'ready' state.
mOnBeforePage() is called by loadPage(). Return true to continue loading the page, false to enter the 'paused' state. Note that when onBeforePage is called, mCurrentUrl will contain the CUrl object for the next page to be loaded.
mOnAfterPage() is called by onLoadPage after all new links have been added to mPagesPending. Return true to load the next page or false to enter the 'paused' state. Note that when mOnAfterPage is called, mDocument will contain a reference to the currently loaded document.
mOnPageTimeout() will be called if the page does not complete loading in mOnLoadTimeoutInterval milliseconds. Return true to enter the 'timeout' state, false to attempt to load the next page.
mOnStop() is called by stop. Return true to enter the 'stopped' state, false to attempt to load the next page.
mOnPause() is called by pause. Return true to enter the 'paused' state, false to attempt to load the next page.
mOnRestart() is called by restart. Return true to load the next page, false to re-enter the 'paused' state.