Jsoup fetching a partial page

后端 未结 1 1416
天涯浪人
天涯浪人 2021-01-02 08:52

I am trying to scrape the contents of bidding websites, but am unable to fetch the complete page of the website . I am using crowbar on xulrunner to fetch the page first (as

1条回答
  •  隐瞒了意图╮
    2021-01-02 09:34

    Try using HtmlUnit to render the page (including JavaScript and CSS dom manipulation) and then pass the rendered HTML to jsoup.

    // load page using HTML Unit and fire scripts
    WebClient webClient = new WebClient();
    HtmlPage myPage = webClient.getPage(myURL);
    
    // convert page to generated HTML and convert to document
    Document doc = Jsoup.parse(myPage.asXml(), baseURI);
    
    // clean up resources        
    webClient.close();
    


    page.html - source code

    
    
        
    
    
        
    col1 col2

    loadData.js

        // append rows and cols to table.data in page.html
        function loadData() {
            data = document.getElementById("data");
            for (var row = 0; row < 2; row++) {
                var tr = document.createElement("tr");
                for (var col = 0; col < 2; col++) {
                    td = document.createElement("td");
                    td.appendChild(document.createTextNode(row + "." + col));
                    tr.appendChild(td);
                }
                data.appendChild(tr);
            }
        }
    

    page.html when loaded to browser

    | Col1   | Col2   |
    | ------ | ------ |
    | 0.0    | 0.1    |
    | 1.0    | 1.1    |
    

    Using jsoup to parse page.html for col data

        // load source from file
        Document doc = Jsoup.parse(new File("page.html"), "UTF-8");
    
        // iterate over row and col
        for (Element row : doc.select("table#data > tbody > tr"))
    
            for (Element col : row.select("td"))
    
                // print results
                System.out.println(col.ownText());
    

    Output

    (empty)

    What happened?

    Jsoup parses the source code as delivered from the server (or in this case loaded from file). It does not invoke client-side actions such as JavaScript or CSS DOM manipulation. In this example, the rows and cols are never appended to the data table.

    How to parse my page as rendered in the browser?

        // load page using HTML Unit and fire scripts
        WebClient webClient = new WebClient();
        HtmlPage myPage = webClient.getPage(new File("page.html").toURI().toURL());
    
        // convert page to generated HTML and convert to document
        doc = Jsoup.parse(myPage.asXml());
    
        // iterate row and col
        for (Element row : doc.select("table#data > tbody > tr"))
    
            for (Element col : row.select("td"))
    
                // print results
                System.out.println(col.ownText());
    
        // clean up resources        
        webClient.close();
    

    Output

    0.0
    0.1
    1.0
    1.1
    

    0 讨论(0)
提交回复
热议问题