Casperjs iterating over a list of links using casper.each

前端 未结 3 959
星月不相逢
星月不相逢 2021-01-12 02:15

I am trying to use Casperjs to get a list of links from a page, then open each of those links, and add to an array object a particular type of data from those pages.

相关标签:
3条回答
  • 2021-01-12 02:55

    If I understand your problem correctly, to solve, give items[] a global scope. In your code, I would have done the following:

    var items = [];
    this.each(listOfLinks, function(self, link) {
    
        var eachPageHref = link.href;
    
        console.log("Creating new array in object for " + eachPageHref);
    
        object[date][eachPageHref] = []; // array for page to store names
    
        self.thenOpen(eachPageHref, function () {
    
            this.evaluate(function() {
            // Perform DOM manipulation to get items
            items.push(whateverThisItemIs);
          });
        });
    

    Hope this helps.

    0 讨论(0)
  • 2021-01-12 03:05

    You are returning DOM nodes in the evaluate() function, which is not allowed. You can return the actual URLs instead.

    Note: The arguments and the return value to the evaluate function must be a simple primitive object. The rule of thumb: if it can be serialized via JSON, then it is fine.

    Closures, functions, DOM nodes, etc. will not work!

    Reference: PhantomJS#evaluate

    0 讨论(0)
  • 2021-01-12 03:06

    I decided to use our own Stackoverflow.com as a demo site to run your script against. There were a few minor things I've corrected in your code and the result is this exercise in getting comments from PhantomJS bounty questions.

    var casper = require('casper').create();
    
    casper
    .start()
    .open('http://stackoverflow.com/questions/tagged/phantomjs?sort=featured&pageSize=30')
    .then(function () {
    
        var date = Date.now(), object = {};
        object[date] = {};
    
        var listOfLinks = this.evaluate(function(){
    
            // Getting links to other pages to scrape, this will be 
            // a primitive array that will be easily returned from page.evaluate
            var links = [].map.call(document.querySelectorAll("#questions .question-hyperlink"), function(link) {
              return link.href;
            });    
            return links;
        });
    
        // Now to iterate over that array of links
        this.each(listOfLinks, function(self, eachPageHref) {
    
            object[date][eachPageHref] = []; // array for page to store names
    
            self.thenOpen(eachPageHref, function () {
    
                // Getting comments from each page, also as an array
                var listOfItems = this.evaluate(function() {
                    var items = [].map.call(document.getElementsByClassName("comment-text"), function(comment) {
                        return comment.innerText;
                    });    
                    return items;
                });
                object[date][eachPageHref] = listOfItems;
            });
        });
    
        // After each links has been scraped, output the resulting object
        this.then(function(){
            console.log(JSON.stringify(object));
        });
    })
    
    casper.run();
    

    What is changed: page.evaluate now returns simple arrays, which are needed for casper.each() to correctly iterate. href attributes are extracted right away in page.evaluate. Also this correction:

     object[date][eachPageHref] = listOfItems; // previously assigned items which were undefined in this scope
    

    The result of the script run is

    {"1478596579898":{"http://stackoverflow.com/questions/40410927/phantomjs-from-node-on-windows":["en.wikipedia.org/wiki/File_URI_scheme – Igor 2 days ago\n","@Igor is there something in particular you see wrong, or are you suggesting the phantom module has an incorrect URI? – Danny Buonocore 2 days ago\n","Probably windows security issue not allowing to run an unsigned program. – Vaviloff yesterday\n"],"http://stackoverflow.com/questions/40412726/casperjs-iterating-over-a-list-of-links-using-casper-each":["Thanks, this looked really promising. I made the changes but it didn't solve the problem. And I just realised that in debug mode the following happens: Creating new array object for https://example.com [debug] [phantom] Navigation requested: url=about:blank, type=Other, willNavigate=true, isMainFrame=true and then Casperjs silently fails. It seems that the correct link that gets passed into thenOpen gets changed to about:blank... – cyc665 yesterday\n"]}}
    
    0 讨论(0)
提交回复
热议问题