rvest web scraping with javascript

问题

I am trying to scrape the daily forecast from FiveThirtyEight using rvest, but my object of interest seems to be a javascript object, which I am having difficulty even locating where and what to look for. (I'm not well versed in CSS or Javascript, though I tried to educate myself in the last couple days.)

By inspecting the webpage element and CSS selector, I have figured out the following:

The location to look is <div id="polling-avg-chart">, so I tried

library(rvest)
url <- 
  "https://projects.fivethirtyeight.com/election-2016/national-primary-polls/democratic/"

url %>% 
  read_html() %>% 
  html_nodes("#polling-avg-chart")

without much success. The output is simply

{xml_nodeset (1)}

[1] <\div id="polling-avg-chart"></div>\n

The individual poll results in dots are in <g style="clip-path: url("#line-clippoll_avg");"> ... </g>, where you see 502 locations in numbers. I'm guessing that I will have to translate cx and cy of each node into the appropriate percentages, which is done by <g class="flag-box" transform="translate(30, 161.44093322753096)">...</g> and so on.
However I do not see the underlying data for the forecast line, not the dots.
When I let my cursor hover over the chart, I see things such as <line class="hover-date-line hide-line"> change, and values such as <path class="link" d="M 0 171.40106812500002 C 15 171.40106812500002 15 170.94093803735575 30 170.94093803735575"></path> change, and I'm guessing that these values are what's creating the daily forecast line.
But where these values are stored, and how to translate it back to things like "49.1% Clinton vs. 26.6% Sanders" is still a mystery to me.

I did read a few other SO posts such as this but none of them seemed applicable to this particular problem. What would be the best way to get the forecast percentages in a neat dataframe?

回答1:

Another way is to grab the resource directly.

In your browser, open Developer Tools (F12 in Chrome/Chromium), head to "Network", refresh (F5), and look for what looks like a nicely formatted JSON. When we've found it, we copy the link address (right-click on the resource > Copy link address).

library(httr)
library(tidyr)
library(purrr)
library(dplyr)
library(ggplot2)

url <- "https://projects.fivethirtyeight.com/election-2016/national-primary-polls/USA.json"

r <- GET(url)

The whole data is there. The weights too, so you can probably recalculate those averages. The data as plotted is in "model":

dat <- 
  jsonlite::fromJSON(content(r, as = "text")) %>% 
  map(purrr::pluck, "model") %>% 
  bind_rows(.id = "party") %>% 
  mutate_all(readr::parse_guess)

# # A tibble: 5,288 x 5
#    party candidate_name state forecastdate poll_avg
#    <chr> <chr>          <chr> <date>          <dbl>
#  1 D     Sanders        USA   2016-07-01       36.5
#  2 D     Clinton        USA   2016-07-01       55.4
#  3 D     Sanders        USA   2016-06-30       37.0
#  4 D     Clinton        USA   2016-06-30       54.6
#  5 D     Sanders        USA   2016-06-29       37.0
#  6 D     Clinton        USA   2016-06-29       54.9
#  7 D     Sanders        USA   2016-06-28       37.2
#  8 D     Clinton        USA   2016-06-28       54.4
#  9 D     Sanders        USA   2016-06-27       37.4
# 10 D     Clinton        USA   2016-06-27       53.9
# # ... with 5,278 more rows

Reproduce graphs:

dat %>% 
  filter(candidate_name %in% c("Clinton", "Kasich", "Sanders", "Trump")) %>% 
  ggplot(aes(forecastdate, poll_avg)) +
  geom_line(aes(col = candidate_name)) +
  facet_wrap(~party)

If you'd like interactivity:

library(dygraphs)
library(htmltools)

foo <- dat %>% 
  filter(candidate_name %in% c("Clinton", "Kasich", "Sanders", "Trump")) %>% 
  split(.$party) %>% 
  map(~ {
    select(.x, forecastdate, candidate_name, poll_avg) %>% 
      spread(candidate_name, poll_avg) %>% 
      {xts(.[-1], .[[1]])} %>%
      dygraph(group = "poll-model") %>% 
      dyRangeSelector()
  })

browsable(tagList(foo))

回答2:

The chart there is almost certainly built with d3.js or a wrapper on top of it. d3 is great for building svg-based data visualizations because it helps you build scales to map values (such as 40%) to placements on the screen (such as what you see, something like cx=100). The problem is you would need to know what those scales are in order to get the underlying data back, and the scales are likely dynamic and changing based on screen size, etc.

Instead, since the data is in a table below, you can easily scrape that. The table is inside a div element with the ID latest-polls, and has a class t-polls.

I'm using html_node with the CSS selectors, html_table to convert the table to a data frame, cleaning up the names, and turning numeric columns into actual numeric columns. There's more you could do next, like format the dates, but hopefully this gets you started.

library(tidyverse)
library(rvest)

url <- "https://projects.fivethirtyeight.com/election-2016/national-primary-polls/democratic/"

polls_df <- url %>% 
  read_html() %>%
  html_node("#latest-polls table.t-polls") %>%
  html_table() %>%
  setNames(c("new", "date", "pollster", "sample_n", "sample_type", names(.)[6:10]) %>% str_remove_all("\\W")) %>%
  mutate_at(vars(sample_n, Clinton, Sanders, OMalley), 
      function(x) str_remove_all(x, "\\D") %>% as.numeric())

head(polls_df)
#>   new           date                     pollster sample_n sample_type
#> 1   •     Jun. 10-13                 Selzer & Co.      486          LV
#> 2   •     Jun. 26-28                     Fox News      432          RV
#> 3   •     Jun. 18-20                       YouGov      390          LV
#> 4   •     Jun. 15-20              Morning Consult     1733          RV
#> 5   • Jun. 27-Jul. 1                Ipsos, online      142          LV
#> 6   •     Jun. 16-19 Opinion Research Corporation      435          RV
#>   weight      leader Clinton Sanders OMalley
#> 1   1.05  Clinton +2      45      43      NA
#> 2   0.91 Clinton +21      58      37      NA
#> 3   0.79 Clinton +13      55      42      NA
#> 4   0.79 Clinton +18      53      35      NA
#> 5   0.67 Clinton +41      70      29      NA
#> 6   0.66 Clinton +12      55      43      NA

来源：https://stackoverflow.com/questions/50381758/rvest-web-scraping-with-javascript

标签

javascript

html

css

rvest