Can I get the original page source (vs current DOM) with phantomjs/casperjs?

前端 未结 3 647
栀梦
栀梦 2021-01-05 04:01

I am trying to get the original source for a particular web page.

The page executes some scripts that modify the DOM as soon as it loads. I would like to get the sou

相关标签:
3条回答
  • 2021-01-05 04:36

    Hum, did you try using some events? For example :

    casper.on('load.started', function(resource) {
        casper.echo(casper.getPageContent());
    });
    

    I think it won't work, try it anyway.

    The problem is : you can't do it in a normal casperJS step because the scripts on your page are already executed. It could work if we could bind the on-DOM-Ready event, or have a specific casper event like that. Problem : the page must be loaded to send some js from Casper to the DOM environment. So binding onready isn't possible (I don't see how). I think with phantom we can scrape DATA after the load event, so only when the page is rendered.

    So if it's not possible to hack it with the events and maybe some delay, your only solution is to block the scripts which modify your DOM.

    There is still the phantomJS option, you use it : in casper :

    casper.pageSettings.javascriptEnabled = false;
    

    The problem is you need the js enabled to get back the data, so it can't work... :p Yeah useless comment ! :)

    Otherwise you have to block the wanted ressource/script which modify the DOM using events.

    Or you could use the resource.received event to scrape the data wanted before the specific resources modifing DOM appear.

    In fact I don't think it's possible because if you create a step which get back some data from page just before specific ressources appear, the time your step is executed, the ressources will have load. It would be necessary to freeze the following ressources while your step is scraping the data.

    Don't know how to do it though, but these events could help you :

    casper.on('resource.requested', function(request) {
        console.log(" request " + request.url);
    });
    
    casper.on('resource.received', function(resource) {
        console.log(resource.url);
    });
    
    casper.on('resource.error',function (request) {
        this.echo('[res : id and url + error description] <-- ' + request.id + ' ' + request.url + ' ' + request.errorString);
    });
    

    See also How do you Disable css in CasperJS?. The solution which would work : you identify the scripts and block them. But if you need them, well I don't know, it's a good question. Maybe we could defer the execution of a specific script. I don't think Casper and phantom easily permit that.The only useful option is abort(), give us this option : timeout("time -> ms") !

    onResourceRequested

    Here a similar question : Injecting script before other

    0 讨论(0)
  • 2021-01-05 04:38

    Regarding the docs you can use #debugPage() to get the content of the current page.

    casper.userAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X)');
    
    casper.start('http://www.xxxxxxx.xxx/login');
    
    casper.waitForSelector('input#login', ... );
    
    casper.then(function() {
      this.debugHTML();
    });
    
    casper.run();
    

    regards david

    0 讨论(0)
  • 2021-01-05 04:58

    As Fanch pointed out, it seems it's not possible to do this. If you are able to do two requests, then this gets easy. Simply do one request with JavaScript enabled and one without, so you can scrape the page source and compare it.

    casper
        .then(function(){
            this.options.pageSettings.javascriptEnabled = false;
        })
        .thenOpen(url, function(){
            this.echo("before JavaScript");
            this.echo(this.getHTML());
        })
        .then(function(){
            this.options.pageSettings.javascriptEnabled = true;
        })
        .thenOpen(url, function(){
            this.echo("before JavaScript");
            this.echo(this.getHTML());
        });
    

    You can change the order according to your needs. If you're already on a page that you want to have the original markup of, then you can use casper.getCurrentUrl() to get the current URL:

    casper
        .then(function(){
            // submit or whatever
        })
        .thenOpen(url, function(){
            this.echo("after JavaScript");
            this.echo(this.getHTML());
            this.options.pageSettings.javascriptEnabled = false;
    
            this.thenOpen(this.getCurrentUrl(), function(){
                this.echo("before JavaScript");
                this.echo(this.getHTML());
            })
        });
    
    0 讨论(0)
提交回复
热议问题