What are the best Java libraries to \"fully download any webpage and render the built-in JavaScript(s) and then access the rendered webpage (that is the DOM-Tree !) programm
This is a bit outside of the box, but if you are planning on running your code in a server where you have complete control over your environment, it might work...
Install Firefox (or XulRunner, if you want to keep things lightweight) on your machine.
Using the Firefox plugins system, write a small plugin which takes loads a given URL, waits a few seconds, then copies the page's DOM into a String.
From this plugin, use the Java LiveConnect API (see http://jdk6.java.net/plugin2/liveconnect/ and https://developer.mozilla.org/en/LiveConnect ) to push that string across to a public static function in some embedded Java code, which can either do the required processing itself or farm it out to some more complicated code.
Benefits: You are using a browser that most application developers target, so the observed behavior should be comparable. You can also upgrade the browser along the normal upgrade path, so your library won't become out-of-date as HTML standards change.
Disadvantages: You will need to have permission to start a non-headless application on your server. You'll also have the complexity of inter-process communication to worry about.
I have used the plugin API to call Java before, and it's quite achievable. If you'd like some sample code, you should take a look at the XQuery plugin - it loads XQuery code from the DOM, passes it across to the Java Saxon library for processing, then pushes the result back into the browser. There are some details about it here:
https://developer.mozilla.org/en/XQuery
You can use JavaFX 2 WebEngine
. Download JavaFX SDK (you may already have it if you installed JDK7u2 or later) and try code below.
It will print html with processed javascript. You can uncomment lines in the middle to see rendering as well.
public class WebLauncher extends Application {
@Override
public void start(Stage stage) {
final WebView webView = new WebView();
final WebEngine webEngine = webView.getEngine();
webEngine.load("http://stackoverflow.com");
//stage.setScene(new Scene(webView));
//stage.show();
webEngine.getLoadWorker().workDoneProperty().addListener(new ChangeListener<Number>() {
@Override
public void changed(ObservableValue<? extends Number> observable, Number oldValue, Number newValue) {
if (newValue.intValue() == 100 /*percents*/) {
try {
org.w3c.dom.Document doc = webEngine.getDocument();
new XMLSerializer(System.out, new OutputFormat(doc, "UTF-8", true)).serialize(doc);
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
});
}
public static void main(String[] args) {
launch();
}
}
The Selenium library is normally used for testing, but does give you remote control of most standard browsers (IE, Firefox, etc) as well as a headless, browser free mode (using HtmlUnit). Because it is intended for UI verification by page scraping, it may well serve your purposes.
In my experience it can sometimes struggle with very slow JavaScript, but with careful use of "wait" commands you can get quite reliable results.
It also has the benefit that you can actually drive the page, not just scrape it. That means that if you perform some actions on the page before you get to the data you want (click the search button, click next, now scrape) then you can code that into the process.
I don't know if you'll be able to get the full DOM in a navigable form from Selenium, but it does provide XPath retrieval for the various parts of the page, which is what you'd normally need for a scraping application.
You can try JExplorer. For more information see http://www.teamdev.com/downloads/jexplorer/docs/JExplorer-PGuide.html
You can also try Cobra, see http://lobobrowser.org/cobra.jsp
I haven't tried this project, but I have seen several implementations for node.js that include javascript dom manipulation.
https://github.com/tmpvar/jsdom
You can use Java, Groovy with or without Grails. Then use Webdriver, Selenium, Spock and Geb these are for testing purposes, but the libraries are useful for your case. You can implement a Crawler that won't open a new window but just a runtime of these either browser.