There is a site I am trying to scrape, that first loads an html/js modifies the form input fields using js and then POSTs. How can I get the final html output of the POSTed
one approach that comes to my mind, besides using a headless browser is obviously to simulate the ajax calls and to ensemble the page post-process, request by request.. this however is often kind of tricky and should be used as a last resort, unless you really like to dig through javascript code..
When I copied your code directly, and changed the URL to www.google.com, it worked fine, with two files saved:
Bear in mind that the files will be written to the location you run the script from, not where your .js file is located
the output code you have is correct, but there is an issue with synchronicity. The output lines that you have are being executed before the page is done loading. You can tie into the onLoadFinished Callback to find out when that happens. See full code below.
var page = new WebPage()
var fs = require('fs');
page.onLoadFinished = function() {
console.log("page load finished");
page.render('export.png');
fs.write('1.html', page.content, 'w');
phantom.exit();
};
page.open("http://www.google.com", function() {
page.evaluate(function() {
});
});
When using a site like google, it can be deceiving because it loads so quicker, that you can often execute a screengrab inline like you have it. Timing is a tricky thing in phantomjs, sometimes I test with setTimeout to see if timing is an issue.
I'm using CasperJS to run tests with PhantomJS. I added this code to my tearDown function:
var require = patchRequire(require);
var fs = require('fs');
casper.test.begin("My Test", {
tearDown: function(){
casper.capture("export.png");
fs.write("1.html", casper.getHTML(undefined, true), 'w');
},
test: function(test){
// test code
casper.run(function(){
test.done();
});
}
});
See docs for capture and getHTML.
This can easily be done with some php code and javascript use fopen() and fwrite() and this function to save it: var generatedSource = new XMLSerializer().serializeToString(document);
I tried several approaches to similar task and the best results I got using Selenium.
Before I tried PhantomJS and Cheerio. Phantom was crashing too often while executing JS on the page.