scrape an angularjs website with java

后端未结

关注

 2  776

梦谈多话

I need to scrape a website with content \'inserted\' by Angular. And it needs to be done with java.

I have tried Selenium Webdriver (as I have used Selenium before

相关标签:

2条回答

时光说笑

2021-01-03 17:12

In the end, I have followed Madusudanan 's excellent advise and I looked into PhantomJS / Selenium combination. And there actually is a solution! Its called PhantomJSDriver.

You can find the maven dependency here. Here is more info on ghost driver.

The setup in Maven- I have added the following:

<dependency>
        <groupId>net.sourceforge.htmlunit</groupId>
        <artifactId>htmlunit</artifactId>
        <version>2.41.0</version>
    </dependency>
    <dependency>
        <groupId>com.github.detro</groupId>
        <artifactId>phantomjsdriver</artifactId>
        <version>1.2.0</version>
    </dependency>

It also runs with Selenium version 2.45 which is the latest version up until now. I am mentioning this, because of some articles I read in which people say that the Phantom driver isn't compatible with every version of Selenium, but I guess they addressed that problem in the meantime.

If you are already using a Selenium/Phantomdriver combination and you are getting 'strict javascript errors' on a certain site, update your version of selenium. That will fix it.

And here is some sample code:

public void testPhantomDriver() throws Exception {
    DesiredCapabilities options = new DesiredCapabilities();
    // the website i am scraping uses ssl, but I dont know what version
    options.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, new String[] {
          "--ssl-protocol=any"
      });

    PhantomJSDriver driver = new PhantomJSDriver(options);

    driver.get("https://www.mywebsite");

    List<WebElement> elements = driver.findElementsByClassName("media-title");

    for(WebElement element : elements ){
        System.out.println(element.getText());
    }

    driver.quit();
}

0 讨论(0)

太阳男子

2021-01-03 17:12

Here is the perfect Solution to scrap any web page with JSoup & WebDriver with java

ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.addArguments("--headless");
WebDriver driver = new romeDriver(chromeOptions);
driver.get(bean.getDomainQuery().trim());
Document doc = Jsoup.parse(driver.getPageSource());

And then use JSoup selectors to read any tag info

0 讨论(0)