Jsoup - hidden div class?

前端 未结 2 712
挽巷
挽巷 2021-01-16 20:32

Im trying to scrape a div class but everything I have tried has failed so far :(

Im trying to scrape the element(s):



        
2条回答
  •  不思量自难忘°
    2021-01-16 21:03

    What you see in your web browser is not what Jsoup sees. Disable JavaScript and refresh page to get what Jsoup gets OR press CTRL+U ("Show source", not "Inspect"!) in your browser to see original HTML document before JavaScript modifications. When you use your browser's debugger it shows final document after modifications so it's not not suitable for your needs.

    It seems like whole "UPCOMING EVENTS" section is dynamically loaded by JavaScript. Even more, this section is asynchronously loaded with AJAX. You can use your browsers debugger (Network tab) to see every possible request and response.

    I found it but unfortunately all the data you need is returned as JSON so you're going to need another library to parse JSON.

    That's not the end of the bad news and this case is more complicated. You could make direct request for the data: http://www.bellator.com/feeds/ent_m152_bellator/V1_1_0/d10a728c-547e-4a6f-b140-7eecb67cff6b but the URL seems random and few of these URLs (one per upcoming event?) are included inside JavaScript code in HTML.

    My approach would be to get the URLs of these feeds with something like:

    
            List feedUrls = new ArrayList<>();
    
            //select all the scripts
            Elements scripts = document.select("script");
            for(Element script: scripts){
                if(script.text().contains("http://www.bellator.com/feeds/")){
                    // here use regexp to get all URLs from script.text() and add them to feedUrls
    
                }
            }
    
            for(String feedUrl : feedUrls){
                // iterate over feed URLs, download each of them
                String json = Jsoup.connect(feedUrl).ignoreContentType(true).get().body().toString();
                // here use JSON parsing library to get the data you need
    
            }
    

    ALTERNATIVE approach would be to stop using Jsoup because of its limitations and use Selenium Webdriver as it supports dynamic page modifications by JavaScript so you'd get the HTML of the final result - exactly what you see in web browser and Inspector.

提交回复
热议问题