scrape a table with rvest in R that has mismatch table heading

百般思念 提交于 2021-02-11 18:24:35


I'm trying to scrape this table which seems like it would be super simple. Here's the url of the table:

Here's what I coded:

url <- ""
x = data.frame(read_html(url) %>% 
  html_nodes("table") %>% 

This works ok but gives really weird two row headers and when I try to add %>% slice(-1) to take out the top row it says I can't because it's a list. Would really like to figure out how to do this.


Here's one solution. An explanation follows.


read_html(url) %>% 
  html_nodes("table") %>%  
  html_table(header = T) %>%
  simplify() %>% 
  first() %>% 
  setNames(paste0(colnames(.), as.character(.[1,]))) %>%

Output of glimpse():

Observations: 25
Variables: 16
$ Rank          <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"…
$ Player        <chr> "Lamar Jackson QB - BAL", "Dak Prescott QB - DAL", "Deshaun W…
$ Opp           <chr> "@MIA", "NYG", "@NO", "@ARI", "@JAX", "@PHI", "PIT", "WAS", "…
$ PassingYds    <chr> "324", "405", "268", "385", "378", "380", "341", "313", "248"…
$ PassingTD     <chr> "5", "4", "3", "3", "3", "3", "3", "3", "3", "3", "2", "2", "…
$ PassingInt    <chr> "-", "-", "1", "-", "-", "-", "-", "-", "-", "1", "1", "1", "…
$ RushingYds    <chr> "6", "12", "40", "22", "2", "-", "-", "5", "24", "6", "13", "…
$ RushingTD     <chr> "-", "-", "1", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ ReceivingRec  <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ ReceivingYds  <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ ReceivingTD   <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ RetTD         <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ MiscFumTD     <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ Misc2PT       <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "1", "-", "…
$ FumLost       <chr> "-", "-", "-", "1", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ FantasyPoints <chr> "33.56", "33.40", "30.72", "27.60", "27.32", "27.20", "25.64"…

From ?html_table docs:

html_table currently makes a few assumptions:

  • No cells span multiple rows
  • Headers are in the first row

Part of your problem is solved by setting header = TRUE in html_table().

Another part of the problem is that the header cells span two rows, which html_table() does not expect.

Assuming you don't want to lose the information in either header row, you can:

  1. Use simplify and first to pull out the data frame from the list you get from html_table
  2. Use setNames to merge the two header rows (which are now the data frame columns and the first row)
  3. Remove the first row (now redundant) with slice

