PhantomJS open() too slow

前端 未结 1 471
滥情空心
滥情空心 2021-01-01 04:37

I\'m having a problem with web scrapping in NodeJS, i want to take some data from the remote webpage but the data is inserted into html from the javascript. I started to use

相关标签:
1条回答
  • 2021-01-01 05:27

    There are several measures you can take to decrease processing time.

    1 . Get a more powerful server/computer (as Mathieu rightly noted).

    Yes, you could argue this is irrelevant to the question, but in matters of scraping it very much is. On a budget $8 VPS without optimization your initial script ran for 9589ms which is already a ~30% improvement.

    2 . Turn off images load. It will help... a bit. 8160ms load time.

    page.settings.loadImages = false;  
    

    3 . Analyze the page, find and cancel unnecessary network requests.

    Even in a normal browser like Google Chrome the site loads slowly: 129 requests/8.79s load time with AdblockPlus. There are a lot of requests (gif, 1Mb), many if them are for third-party sites like facebook, twitter (to fetch widgets) and to ad sites.

    We can cancel them too:

    block_urls = ['gstatic.com', 'adocean.pl', 'gemius.pl', 'twitter.com', 'facebook.net', 'facebook.com', 'planplus.rs'];
    
    page.onResourceRequested = function(requestData, request){
        for(url in block_urls) {
            if(requestData.url.indexOf(block_urls[url]) !== -1) {
                request.abort();
                console.log(requestData.url + " aborted");
                return;
            }
        }   
    }
    

    The load time for me now is just 4393ms while the page is loaded and usable: PhantomJS screenshot

    I don't think much more can be done without tinkering with page's code because judging by the page source it is quite script-heavy.

    The whole code:

    var page = require('webpage').create();
    var fs = require("fs");
    
    // console.time polyfill from https://github.com/callmehiphop/console-time
    ;(function( console ) {
      var timers;
      if ( !console ) {
        return;
      }
      timers = {};
      console.time = function( name ) {
        if ( name ) {
          timers[ name ] = Date.now();
        }
      };
      console.timeEnd = function( name ) {
        if ( timers[ name ] ) {
          console.log( name + ': ' + (Date.now() - timers[ name ]) + 'ms' );
          delete timers[ name ];
        }
      };
    }( window.console ));
    
    console.time("open");
    
    page.settings.loadImages = false;
    page.settings.userAgent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36';
    page.viewportSize = {
      width: 1280,
      height: 800
    };
    
    block_urls = ['gstatic.com', 'adocean.pl', 'gemius.pl', 'twitter.com', 'facebook.net', 'facebook.com', 'planplus.rs'];
    page.onResourceRequested = function(requestData, request){
        for(url in block_urls) {
            if(requestData.url.indexOf(block_urls[url]) !== -1) {
                request.abort();
                console.log(requestData.url + " aborted");
                return;
            }
        }            
    }
    
    page.open('https://www.halooglasi.com/nekretnine/izdavanje-stanova/novi-beograd---novi-merkator-id19270/5425485514649', function () {
        fs.write("longload.html", page.content, 'w');
    
        console.timeEnd("open");
    
        setTimeout(function(){
            page.render('longload.png');
            phantom.exit();
        }, 3000);
    
    });
    
    0 讨论(0)
提交回复
热议问题