call back on cheerio node.js

无人久伴 提交于 2019-12-11 18:36:36

问题


I'm trying to write a scrapper using 'request' and 'cheerio'. I have an array of 100 urls. I'm looping over the array and using 'request' on each url and then doing cheerio.load(body). If I increase i above 3 (i.e. change it to i < 3 for testing) the scraper breaks because var productNumber is undefined and I can't call split on undefined variable. I think that the for loop is moving on before the webpage responds and has time to load the body with cheerio, and this question: nodeJS - Using a callback function with Cheerio would seem to agree.

My problem is that I don't understand how I can make sure the webpage has 'loaded' or been parsed in each iteration of the loop so that I don't get any undefined variables. According to the other answer I don't need a callback, but then how do I do it?

for (var i = 0; i < productLinks.length; i++) {
    productUrl = productLinks[i];
    request(productUrl, function(err, resp, body) {
        if (err)
            throw err;
        $ = cheerio.load(body);
        var imageUrl = $("#bigImage").attr('src'),
            productNumber = $("#product").attr('class').split(/\s+/)[3].split("_")[1]
        console.log(productNumber);

    });
};

Example of output:

1461536
1499543

TypeError: Cannot call method 'split' of undefined

回答1:


You are scraping some external site(s). You can't be sure the HTML all fits exactly the same structure, so you need to be defensive on how you traverse it.

var product = $('#product');
if (!product) return console.log('Cannot find a product element');
var productClass = product.attr('class');
if (!productClass) return console.log('Product element does not have a class defined');
var productNumber = productClass.split(/\s+/)[3].split("_")[1];
console.log(productNumber);

This'll help you debug where things are going wrong, and perhaps indicate that you can't scrape your dataset as easily as you'd hoped.




回答2:


Since you're not creating a new $ variable for each iteration, it's being overwritten when a request is completed. This can lead to undefined behaviour, where one iteration of the loop is using $ just as it's being overwritten by another iteration.

So try creating a new variable:

var $ = cheerio.load(body);
^^^ this is the important part

Also, you are correct in assuming that the loop continues before the request is completed (in your situation, it isn't cheerio.load that is asynchronous, but request is). That's how asynchronous I/O works.

To coordinate asynchronous operations you can use, for instance, the async module; in this case, async.eachSeries might be useful.



来源:https://stackoverflow.com/questions/19072172/call-back-on-cheerio-node-js

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!