13 thoughts on “YASU 0.0.2.15”

  1. I’m not sure. I attempted to load forbes using the remote xul form: chrome://spider/content and appeared to reproduce your problem. It seemed that forbes didn’t complete sending the images for the page and it timed out.

    I tried again with the full extension version starting spider from the command line and did not reproduce the problem.

    Finally, I tried again with the remote xul form and it seemed to work fine.

    My guess is just that forbes sometimes doesn’t complete sending the data for a page and it causes the onload event handler to not fire.

    Try again and see if it is reproducible. If it is, look in the Error console and see if there is any indication of an internal error that may have stopped Spider.

    Are you using a userhook script? Maybe that failed?

    Thanks for the comment and for using Spider.

    /bc

  2. hmmm, if you are running in remote xul form do you see:

    Error: [Exception… “‘goQuitApplication: privilege failure TypeError: window.netscape is undefined’ when calling method: [nsIObserver::observe]” nsresult: “0x8057001e (NS_ERROR_XPC_JS_THREW_STRING)” location: “” data: no]

  3. I must admit that I’m not that familiar with this but it sounds like a really useful tool. So I will use this update to get familiar with it. So thanks for that hint.

  4. After enabling java-script errors and warnings, now I see it fails to go beyond the first page only occasionally (like 10 percent).

    Can the presence of elements be the source for failure? I am thinking this way because http://www.vanityfair.com fails to go beyond the home page all the time. Their page is made full of div elements.

    I do not see any errors thrown during the failure. Presence/absence of a hooker function did not have any effect in my case.

    The add-on works fine on most of the sites when run in a linux environment. But when it spiders through some sites with video made for Windows (like streaming video from CBS), firefox hangs. I do not have a way of escaping from it.

    Thanks

  5. Just to clarify : when I said “The add-on works fine on most of the sites when run in a linux environment.” I meant sites other than forbes and vanityfair.

  6. Ok. I can confirm the behavior you are seeing. I don’t know what is up but it looks like a bug in Spider. I’ll try to check it out this evening and see if I can figure it out and fix it. Thanks for the report.

  7. Hello,

    I am trying to implement a user hook to use with the spider. When I try to retrieve the window using

    =====
    var aDocument = gSpider.mDocument;
    var aWindow = gSpider.mDocument.defaultView;
    var num = aWindow.frames.length;
    logn(“number of frames ” + num);
    =====

    the returned frames count does not contain all the frames (even though they are visible in the spider window). That is the number of frames returned is less than the built-in pageInfo.js script returns. This is how pageInfo.js does it :
    ====
    if (“arguments” in window && window.arguments.length >= 1 &&
    window.arguments[0] && window.arguments[0].doc) {
    gDocument = window.arguments[0].doc;
    gWindow = gDocument.defaultView;
    }
    else {
    if (“gBrowser” in window.opener) {
    gWindow = window.opener.gBrowser.contentWindow;
    }else {
    gWindow = window.opener.frames[0];
    }
    gDocument = gWindow.document;
    }

    var num = gWindow.frames.length;
    logn(“number of frames ” + num);
    ====

    What is the thing I am doing wrong here?

    Thank you
    Mohamed

  8. Hello,

    Thank for your response. Yes, you are right. It is an wrappedJSObject. But it appears still I do not get all the frames.

    What my hook is trying to do is download all the images in each page the spider visits. Since this is identical to the function of firefox pageInfo, I am trying to use the flow of pageInfo.js (I cannot think of a way of directly using pageInfo.js itself).

    I can see my hook is missing out some embedded flash images which pageInfo is able to capture. I did some debugging and the difference I notice is pageInfo.js gets more frames out of the window. The way pageInfo receives the window object is
    ====
    gWindow = window.opener.gBrowser.contentWindow;
    gDocument = gWindow.document;
    ====
    The way my hook obtains them is
    ====
    var aDocument = gSpider.mDocument;
    var aWindow = gSpider.mDocument.defaultView;
    if (aWindow.wrappedJSObject)
    {aWindow = aWindow.wrappedJSObject;
    }
    ====

    The processes after that are identical: go through all the frames and use tree-walker.

    To cite an example, if I use spider to visit http://home.live.com after signing in to hotmail, the hook says there are 2 frames and it misses out the embedded flash image that appears at the right bottom corner. If I use pageInfo, it says there are four frames and it captures the said flash.

  9. Hello,

    I added the following line in userOnAfterPage() to see what happens.

    window.open(“chrome://browser/content/pageinfo/pageInfo.xul”);

    Interestingly the popped out pageInfo window did not have the iframe-embedded image that I could see the in the parent spider window! So it means what we get is not what we see.

    It happens on many sites including http://www.aol.com.

    May be it is happening for elements inside DIV elements?

  10. Hello,

    It seems like the document is missing some frames because at the time userOnAfterPage() is called not all the content has been loaded.

    This can be solved by increasing the delay in doGrab() or adding a delay in userOnAfterPage() before retrieving the document using gSpider.mDocument.

    Thank you for your help.

Comments are closed.