scrape an angularjs website with java

后端 未结 2 774
梦谈多话
梦谈多话 2021-01-03 16:47

I need to scrape a website with content \'inserted\' by Angular. And it needs to be done with java.

I have tried Selenium Webdriver (as I have used Selenium before

相关标签:
2条回答
  • 2021-01-03 17:12

    In the end, I have followed Madusudanan 's excellent advise and I looked into PhantomJS / Selenium combination. And there actually is a solution! Its called PhantomJSDriver.

    You can find the maven dependency here. Here is more info on ghost driver.

    The setup in Maven- I have added the following:

    <dependency>
            <groupId>net.sourceforge.htmlunit</groupId>
            <artifactId>htmlunit</artifactId>
            <version>2.41.0</version>
        </dependency>
        <dependency>
            <groupId>com.github.detro</groupId>
            <artifactId>phantomjsdriver</artifactId>
            <version>1.2.0</version>
        </dependency>
    

    It also runs with Selenium version 2.45 which is the latest version up until now. I am mentioning this, because of some articles I read in which people say that the Phantom driver isn't compatible with every version of Selenium, but I guess they addressed that problem in the meantime.

    If you are already using a Selenium/Phantomdriver combination and you are getting 'strict javascript errors' on a certain site, update your version of selenium. That will fix it.

    And here is some sample code:

    public void testPhantomDriver() throws Exception {
        DesiredCapabilities options = new DesiredCapabilities();
        // the website i am scraping uses ssl, but I dont know what version
        options.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, new String[] {
              "--ssl-protocol=any"
          });
    
        PhantomJSDriver driver = new PhantomJSDriver(options);
    
        driver.get("https://www.mywebsite");
    
        List<WebElement> elements = driver.findElementsByClassName("media-title");
    
        for(WebElement element : elements ){
            System.out.println(element.getText());
        }
    
        driver.quit();
    }
    
    0 讨论(0)
  • 2021-01-03 17:12

    Here is the perfect Solution to scrap any web page with JSoup & WebDriver with java

    ChromeOptions chromeOptions = new ChromeOptions();
    chromeOptions.addArguments("--headless");
    WebDriver driver = new romeDriver(chromeOptions);
    driver.get(bean.getDomainQuery().trim());
    Document doc = Jsoup.parse(driver.getPageSource());
    

    And then use JSoup selectors to read any tag info

    0 讨论(0)
提交回复
热议问题