I\'m trying to scrape this website:
http://www.racingpost.com/greyhounds/result_home.sd#resultDay=2015-12-26&meetingId=18&isFullMeeting=true
through
It's making an XHR request to generate the HTML. Try this (which should also make it easier to automate the data capture):
library(httr)
library(xml2)
library(rvest)
res <- GET("http://www.racingpost.com/greyhounds/result_by_meeting_full.sd",
query=list(r_date="2015-12-26",
meeting_id=18))
doc <- read_html(content(res, as="text"))
html_nodes(doc, ".black")
## {xml_nodeset (56)}
## [1] <span class="black">A9</span>
## [2] <span class="black">£61</span>
## [3] <span class="black">470m</span>
## [4] <span class="black">-30</span>
## [5] <span class="black">H2</span>
## [6] <span class="black">£105</span>
## [7] <span class="black">470m</span>
## [8] <span class="black">-30</span>
## [9] <span class="black">A7</span>
## [10] <span class="black">£61</span>
## [11] <span class="black">470m</span>
## [12] <span class="black">-30</span>
## [13] <span class="black">A5</span>
## [14] <span class="black">£66</span>
## [15] <span class="black">470m</span>
## [16] <span class="black">-30</span>
## [17] <span class="black">A8</span>
## [18] <span class="black">£61</span>
## [19] <span class="black">470m</span>
## [20] <span class="black">-20</span>
## ...
Your selector is good and rvest
is working just fine. The problem is that what you are looking for is not in url
object.
If you open that website and use web browser inspecting tool, you will see that all data you want is descendant of <div id="resultMainOutput">
. Now if you look up source code of this website, you will this (line-breaks added for readability):
<div id="resultMainOutput">
<div class="wait">
<img src="http://ui.racingpost.com/img/all/loading.gif" alt="Loading..." />
</div>
</div>
Data you want is loaded dynamically and rvest
is not able to cope with that. It can only fetch website source code and retrieve anything that there is without any client-side processing.
The exact same issue was brought up in rvest-introducing blog post and here is what package author had to say:
You have two options for pages like that:
Use the debug console in the web browser to reverse engineer the communications protocol and request the raw data directly from the server.
Use a package like RSelenium to automate a web browser.
If you don't need to obtain that data repeatedly, or you can accept a bit of manual work in every analysis, the easiest workaround is:
<div id="resultMainOutput">
content)> url <- read_html("/tmp/racingpost.html")
> html_nodes(url, ".black")
# {xml_nodeset (56)}
# [1] <span class="black">A9</span>
# [2] <span class="black">£61</span>
# [3] <span class="black">470m</span>
# [4] <span class="black">-30</span>
# (skip the rest)
As you can see, there are some encoding issues along the way, but they can be solved later on.