问题
I am trying to scrape the daily forecast from FiveThirtyEight using rvest
, but my object of interest seems to be a javascript object, which I am having difficulty even locating where and what to look for. (I'm not well versed in CSS or Javascript, though I tried to educate myself in the last couple days.)
By inspecting the webpage element and CSS selector, I have figured out the following:
The location to look is
<div id="polling-avg-chart">
, so I triedlibrary(rvest) url <- "https://projects.fivethirtyeight.com/election-2016/national-primary-polls/democratic/" url %>% read_html() %>% html_nodes("#polling-avg-chart")
without much success. The output is simply
{xml_nodeset (1)}
[1] <\div id="polling-avg-chart"></div>\n
The individual poll results in dots are in
<g style="clip-path: url("#line-clippoll_avg");"> ... </g>
, where you see 502 locations in numbers. I'm guessing that I will have to translatecx
andcy
of each node into the appropriate percentages, which is done by<g class="flag-box" transform="translate(30, 161.44093322753096)">...</g>
and so on.However I do not see the underlying data for the forecast line, not the dots.
- When I let my cursor hover over the chart, I see things such as
<line class="hover-date-line hide-line">
change, and values such as<path class="link" d="M 0 171.40106812500002 C 15 171.40106812500002 15 170.94093803735575 30 170.94093803735575"></path>
change, and I'm guessing that these values are what's creating the daily forecast line. - But where these values are stored, and how to translate it back to things like "49.1% Clinton vs. 26.6% Sanders" is still a mystery to me.
I did read a few other SO posts such as this but none of them seemed applicable to this particular problem. What would be the best way to get the forecast percentages in a neat dataframe?
回答1:
Another way is to grab the resource directly.
In your browser, open Developer Tools (F12 in Chrome/Chromium), head to "Network", refresh (F5), and look for what looks like a nicely formatted JSON. When we've found it, we copy the link address (right-click on the resource > Copy link address).
library(httr)
library(tidyr)
library(purrr)
library(dplyr)
library(ggplot2)
url <- "https://projects.fivethirtyeight.com/election-2016/national-primary-polls/USA.json"
r <- GET(url)
The whole data is there. The weights too, so you can probably recalculate those averages. The data as plotted is in "model"
:
dat <-
jsonlite::fromJSON(content(r, as = "text")) %>%
map(purrr::pluck, "model") %>%
bind_rows(.id = "party") %>%
mutate_all(readr::parse_guess)
# # A tibble: 5,288 x 5
# party candidate_name state forecastdate poll_avg
# <chr> <chr> <chr> <date> <dbl>
# 1 D Sanders USA 2016-07-01 36.5
# 2 D Clinton USA 2016-07-01 55.4
# 3 D Sanders USA 2016-06-30 37.0
# 4 D Clinton USA 2016-06-30 54.6
# 5 D Sanders USA 2016-06-29 37.0
# 6 D Clinton USA 2016-06-29 54.9
# 7 D Sanders USA 2016-06-28 37.2
# 8 D Clinton USA 2016-06-28 54.4
# 9 D Sanders USA 2016-06-27 37.4
# 10 D Clinton USA 2016-06-27 53.9
# # ... with 5,278 more rows
Reproduce graphs:
dat %>%
filter(candidate_name %in% c("Clinton", "Kasich", "Sanders", "Trump")) %>%
ggplot(aes(forecastdate, poll_avg)) +
geom_line(aes(col = candidate_name)) +
facet_wrap(~party)
If you'd like interactivity:
library(dygraphs)
library(htmltools)
foo <- dat %>%
filter(candidate_name %in% c("Clinton", "Kasich", "Sanders", "Trump")) %>%
split(.$party) %>%
map(~ {
select(.x, forecastdate, candidate_name, poll_avg) %>%
spread(candidate_name, poll_avg) %>%
{xts(.[-1], .[[1]])} %>%
dygraph(group = "poll-model") %>%
dyRangeSelector()
})
browsable(tagList(foo))
回答2:
The chart there is almost certainly built with d3.js or a wrapper on top of it. d3 is great for building svg-based data visualizations because it helps you build scales to map values (such as 40%) to placements on the screen (such as what you see, something like cx=100
). The problem is you would need to know what those scales are in order to get the underlying data back, and the scales are likely dynamic and changing based on screen size, etc.
Instead, since the data is in a table below, you can easily scrape that. The table is inside a div
element with the ID latest-polls
, and has a class t-polls
.
I'm using html_node
with the CSS selectors, html_table
to convert the table to a data frame, cleaning up the names, and turning numeric columns into actual numeric columns. There's more you could do next, like format the dates, but hopefully this gets you started.
library(tidyverse)
library(rvest)
url <- "https://projects.fivethirtyeight.com/election-2016/national-primary-polls/democratic/"
polls_df <- url %>%
read_html() %>%
html_node("#latest-polls table.t-polls") %>%
html_table() %>%
setNames(c("new", "date", "pollster", "sample_n", "sample_type", names(.)[6:10]) %>% str_remove_all("\\W")) %>%
mutate_at(vars(sample_n, Clinton, Sanders, OMalley),
function(x) str_remove_all(x, "\\D") %>% as.numeric())
head(polls_df)
#> new date pollster sample_n sample_type
#> 1 • Jun. 10-13 Selzer & Co. 486 LV
#> 2 • Jun. 26-28 Fox News 432 RV
#> 3 • Jun. 18-20 YouGov 390 LV
#> 4 • Jun. 15-20 Morning Consult 1733 RV
#> 5 • Jun. 27-Jul. 1 Ipsos, online 142 LV
#> 6 • Jun. 16-19 Opinion Research Corporation 435 RV
#> weight leader Clinton Sanders OMalley
#> 1 1.05 Clinton +2 45 43 NA
#> 2 0.91 Clinton +21 58 37 NA
#> 3 0.79 Clinton +13 55 42 NA
#> 4 0.79 Clinton +18 53 35 NA
#> 5 0.67 Clinton +41 70 29 NA
#> 6 0.66 Clinton +12 55 43 NA
来源:https://stackoverflow.com/questions/50381758/rvest-web-scraping-with-javascript