问题
I'm trying to write a scrapper using 'request' and 'cheerio'. I have an array of 100 urls. I'm looping over the array and using 'request' on each url and then doing cheerio.load(body). If I increase i above 3 (i.e. change it to i < 3 for testing) the scraper breaks because var productNumber is undefined and I can't call split on undefined variable. I think that the for loop is moving on before the webpage responds and has time to load the body with cheerio, and this question: nodeJS - Using a callback function with Cheerio would seem to agree.
My problem is that I don't understand how I can make sure the webpage has 'loaded' or been parsed in each iteration of the loop so that I don't get any undefined variables. According to the other answer I don't need a callback, but then how do I do it?
for (var i = 0; i < productLinks.length; i++) {
productUrl = productLinks[i];
request(productUrl, function(err, resp, body) {
if (err)
throw err;
$ = cheerio.load(body);
var imageUrl = $("#bigImage").attr('src'),
productNumber = $("#product").attr('class').split(/\s+/)[3].split("_")[1]
console.log(productNumber);
});
};
Example of output:
1461536
1499543
TypeError: Cannot call method 'split' of undefined
回答1:
You are scraping some external site(s). You can't be sure the HTML all fits exactly the same structure, so you need to be defensive on how you traverse it.
var product = $('#product');
if (!product) return console.log('Cannot find a product element');
var productClass = product.attr('class');
if (!productClass) return console.log('Product element does not have a class defined');
var productNumber = productClass.split(/\s+/)[3].split("_")[1];
console.log(productNumber);
This'll help you debug where things are going wrong, and perhaps indicate that you can't scrape your dataset as easily as you'd hoped.
回答2:
Since you're not creating a new $
variable for each iteration, it's being overwritten when a request is completed. This can lead to undefined behaviour, where one iteration of the loop is using $
just as it's being overwritten by another iteration.
So try creating a new variable:
var $ = cheerio.load(body);
^^^ this is the important part
Also, you are correct in assuming that the loop continues before the request is completed (in your situation, it isn't cheerio.load
that is asynchronous, but request
is). That's how asynchronous I/O works.
To coordinate asynchronous operations you can use, for instance, the async module; in this case, async.eachSeries might be useful.
来源:https://stackoverflow.com/questions/19072172/call-back-on-cheerio-node-js