I need to scrape a website with content \'inserted\' by Angular. And it needs to be done with java.
I have tried Selenium Webdriver (as I have used Selenium before
In the end, I have followed Madusudanan 's excellent advise and I looked into PhantomJS / Selenium combination. And there actually is a solution! Its called PhantomJSDriver.
You can find the maven dependency here. Here is more info on ghost driver.
The setup in Maven- I have added the following:
net.sourceforge.htmlunit
htmlunit
2.41.0
com.github.detro
phantomjsdriver
1.2.0
It also runs with Selenium version 2.45 which is the latest version up until now. I am mentioning this, because of some articles I read in which people say that the Phantom driver isn't compatible with every version of Selenium, but I guess they addressed that problem in the meantime.
If you are already using a Selenium/Phantomdriver combination and you are getting 'strict javascript errors' on a certain site, update your version of selenium. That will fix it.
And here is some sample code:
public void testPhantomDriver() throws Exception {
DesiredCapabilities options = new DesiredCapabilities();
// the website i am scraping uses ssl, but I dont know what version
options.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, new String[] {
"--ssl-protocol=any"
});
PhantomJSDriver driver = new PhantomJSDriver(options);
driver.get("https://www.mywebsite");
List elements = driver.findElementsByClassName("media-title");
for(WebElement element : elements ){
System.out.println(element.getText());
}
driver.quit();
}