Rvest not recognizing css selector

前端 未结 2 652
醉话见心
醉话见心 2020-12-20 03:52

I\'m trying to scrape this website:

http://www.racingpost.com/greyhounds/result_home.sd#resultDay=2015-12-26&meetingId=18&isFullMeeting=true

through

相关标签:
2条回答
  • 2020-12-20 04:35

    It's making an XHR request to generate the HTML. Try this (which should also make it easier to automate the data capture):

    library(httr)
    library(xml2)
    library(rvest)
    
    res <- GET("http://www.racingpost.com/greyhounds/result_by_meeting_full.sd",
               query=list(r_date="2015-12-26",
                          meeting_id=18))
    
    doc <- read_html(content(res, as="text"))
    
    html_nodes(doc, ".black")
    ## {xml_nodeset (56)}
    ##  [1] <span class="black">A9</span>
    ##  [2] <span class="black">£61</span>
    ##  [3] <span class="black">470m</span>
    ##  [4] <span class="black">-30</span>
    ##  [5] <span class="black">H2</span>
    ##  [6] <span class="black">£105</span>
    ##  [7] <span class="black">470m</span>
    ##  [8] <span class="black">-30</span>
    ##  [9] <span class="black">A7</span>
    ## [10] <span class="black">£61</span>
    ## [11] <span class="black">470m</span>
    ## [12] <span class="black">-30</span>
    ## [13] <span class="black">A5</span>
    ## [14] <span class="black">£66</span>
    ## [15] <span class="black">470m</span>
    ## [16] <span class="black">-30</span>
    ## [17] <span class="black">A8</span>
    ## [18] <span class="black">£61</span>
    ## [19] <span class="black">470m</span>
    ## [20] <span class="black">-20</span>
    ## ...
    
    0 讨论(0)
  • 2020-12-20 04:44

    Your selector is good and rvest is working just fine. The problem is that what you are looking for is not in url object.

    If you open that website and use web browser inspecting tool, you will see that all data you want is descendant of <div id="resultMainOutput">. Now if you look up source code of this website, you will this (line-breaks added for readability):

    <div id="resultMainOutput">
        <div class="wait">
           <img src="http://ui.racingpost.com/img/all/loading.gif" alt="Loading..." />
        </div>
    </div>
    

    Data you want is loaded dynamically and rvest is not able to cope with that. It can only fetch website source code and retrieve anything that there is without any client-side processing.

    The exact same issue was brought up in rvest-introducing blog post and here is what package author had to say:

    You have two options for pages like that:

    1. Use the debug console in the web browser to reverse engineer the communications protocol and request the raw data directly from the server.

    2. Use a package like RSelenium to automate a web browser.

    If you don't need to obtain that data repeatedly, or you can accept a bit of manual work in every analysis, the easiest workaround is:

    1. Open website in web browser of choice
    2. Using web browser inspecting tool, copy current website content (entire page or only <div id="resultMainOutput"> content)
    3. Paste that thing into text editor and save it as new file
    4. Run analysis on that file
    > url <- read_html("/tmp/racingpost.html")
    > html_nodes(url, ".black")
    # {xml_nodeset (56)}
    # [1] <span class="black">A9</span>
    # [2] <span class="black">£61</span>
    # [3] <span class="black">470m</span>
    # [4] <span class="black">-30</span>
    # (skip the rest)
    

    As you can see, there are some encoding issues along the way, but they can be solved later on.

    0 讨论(0)
提交回复
热议问题