Getting web elements using Jsoup

前端 未结 1 856
醉梦人生
醉梦人生 2021-01-26 01:27

I\'m trying to use Jsoup to get stock data from a website called morningstar. I\'ve looked at other forums and haven\'t been able to find out what\'s wrong.

<
相关标签:
1条回答
  • 2021-01-26 01:53

    Since the content is created dynamically using javascript, you could use a headless browser like HtmlUnit https://sourceforge.net/projects/htmlunit/

    The information regarding the price, etc. is embedded in an iFrame, so we first grab the (also dynamically build) iFrame link and parse the iFrame afterwards.

    java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
    
    final WebClient webClient = new WebClient(BrowserVersion.CHROME);
    webClient.getOptions().setCssEnabled(false);
    webClient.getOptions().setJavaScriptEnabled(true);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    webClient.getOptions().setTimeout(1000);
    
    HtmlPage page = webClient.getPage("http://www.morningstar.com/stocks/xnas/aapl/quote.html");
    
    Document doc = Jsoup.parse(page.asXml());
    
    String title = doc.select(".r_title").select("h1").text();
    
    String iFramePath = "http:" + doc.select("#quote_quicktake").select("iframe").attr("src");
    
    page = webClient.getPage(iFramePath);
    
    doc = Jsoup.parse(page.asXml());
    
    System.out.println(title + " | Last Price [$]: " + doc.select("#last-price-value").text());
    

    prints:

    Apple Inc | Last Price [$]: 98.63
    

    The javascript engine in HtmlUnit is rather slow (above code takes about 18 seconds on my machine), so it might be useful to look into other javascript engines/headless browsers (phantomJs, etc.; check this list of options: https://github.com/dhamaniasad/HeadlessBrowsers) to enhance the performance, but HtmlUnit gets the job done. You could also try to filter non relevant scripts, images, etc. with a custom WebConnectionWrapper:

    http://htmlunit.10904.n7.nabble.com/load-parse-speedup-tp22735p22738.html

    0 讨论(0)
提交回复
热议问题