Cheerio doesn't wait for body to load

只谈情不闲聊 提交于 2019-12-10 12:23:21

问题


I made a very simple script which scrape a recipes website to get the title, time of preparation and the ingredients. Everything works fine except that the script is not able to scrape each page of my arrays. Sometimes i get 4 of them, sometimes 2, sometimes even 0 ...

It seems that the script doesn't wait the body to be fully loaded. I'm fully aware that cheerio doesn't understand javascript on website, but for all i know the information I scrape aren't generated from any script, it is pure HTML.

How can i ask cheerio to wait 1 second when a page is visited, or simply to wait for the html to be fully loaded.

Here is my code, it works so you can try it, and an example of the output :

pools = [
     "http://www.marmiton.org/recettes/recette_salade-de-betteraves-a-l-orientale_16831.aspx",
     "http://www.marmiton.org/recettes/recette_pain-d-epices-a-la-dijonnaise_16832.aspx",
     "http://www.marmiton.org/recettes/recette_tarte-au-chocolat-et-creme-moka_16834.aspx",
     "http://www.marmiton.org/recettes/recette_poulet-a-la-gaston-gerard_16836.aspx",
   "http://www.marmiton.org/recettes/recette_assiette-paula_16837.aspx"]

    var request = require("request");
    var cheerio = require("cheerio");
    var poolsLength = pools.length;

    for (var i = 0 ; i < pools.length ; i++) {
       var url = pools[i];
        request(url, function (error, response, body) {
         if (!error) {
        var $ = cheerio.load(body,{
          ignoreWhitespace: true
    });
       var name = [];
       var address = [];
       var website = [];

    $('body').each(function(i, elem){
          name = $(elem).find('.fn').text();
          address = $(elem).find('.preptime').text();
          website = $(elem).find('.m_content_recette_ingredients').text();
          console.log(name+"±"+address+"±"+website);}
     )}
    })
    };`

As you can see above, it only worked for 2 of 5 pages.


回答1:


You can try the following code, the setTimeout will cause a delay for the page to load before scraping.

pools = [
         "http://www.marmiton.org/recettes/recette_salade-de-betteraves-a-l-orientale_16831.aspx",
         "http://www.marmiton.org/recettes/recette_pain-d-epices-a-la-dijonnaise_16832.aspx",
         "http://www.marmiton.org/recettes/recette_tarte-au-chocolat-et-creme-moka_16834.aspx",
         "http://www.marmiton.org/recettes/recette_poulet-a-la-gaston-gerard_16836.aspx",
       "http://www.marmiton.org/recettes/recette_assiette-paula_16837.aspx"]

        var request = require("request");
        var cheerio = require("cheerio");
        var poolsLength = pools.length;
        var interval = 10 * 1000; // 10 seconds;
        for (var i = 0 ; i < pools.length ; i++) {
           var url = pools[i];
           setTimeout( function (i) {
            request(url, function (error, response, body) {
             if (!error) {
            var $ = cheerio.load(body,{
              ignoreWhitespace: true
        });
           var name = [];
           var address = [];
           var website = [];

        $('body').each(function(i, elem){
              name = $(elem).find('.fn').text();
              address = $(elem).find('.preptime').text();
              website = $(elem).find('.m_content_recette_ingredients').text();
              console.log(name+"±"+address+"±"+website);}
         )
        }
        }, interval * i, i);
        })
        }



回答2:


For handling many pages scrapping, just give a callback function to mark when the task is done, then using the async.parallel module to run.

My solution here:

http://paste.ubuntu.com/p/vfDnbjPw87/



来源:https://stackoverflow.com/questions/44797467/cheerio-doesnt-wait-for-body-to-load

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!