I need to scrape a website with content \'inserted\' by Angular. And it needs to be done with java.
I have tried Selenium Webdriver (as I have used Selenium before
In the end, I have followed Madusudanan 's excellent advise and I looked into PhantomJS / Selenium combination. And there actually is a solution! Its called PhantomJSDriver.
You can find the maven dependency here. Here is more info on ghost driver.
The setup in Maven- I have added the following:
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.41.0</version>
</dependency>
<dependency>
<groupId>com.github.detro</groupId>
<artifactId>phantomjsdriver</artifactId>
<version>1.2.0</version>
</dependency>
It also runs with Selenium version 2.45 which is the latest version up until now. I am mentioning this, because of some articles I read in which people say that the Phantom driver isn't compatible with every version of Selenium, but I guess they addressed that problem in the meantime.
If you are already using a Selenium/Phantomdriver combination and you are getting 'strict javascript errors' on a certain site, update your version of selenium. That will fix it.
And here is some sample code:
public void testPhantomDriver() throws Exception {
DesiredCapabilities options = new DesiredCapabilities();
// the website i am scraping uses ssl, but I dont know what version
options.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, new String[] {
"--ssl-protocol=any"
});
PhantomJSDriver driver = new PhantomJSDriver(options);
driver.get("https://www.mywebsite");
List<WebElement> elements = driver.findElementsByClassName("media-title");
for(WebElement element : elements ){
System.out.println(element.getText());
}
driver.quit();
}
Here is the perfect Solution to scrap any web page with JSoup & WebDriver with java
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.addArguments("--headless");
WebDriver driver = new romeDriver(chromeOptions);
driver.get(bean.getDomainQuery().trim());
Document doc = Jsoup.parse(driver.getPageSource());
And then use JSoup selectors to read any tag info