scrape a table with rvest in R that has mismatch table heading

百般思念 提交于 2021-02-11 18:24:35

问题


I'm trying to scrape this table which seems like it would be super simple. Here's the url of the table: https://fantasy.nfl.com/research/scoringleaders?position=1&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek=1

Here's what I coded:

url <- "https://fantasy.nfl.com/research/scoringleaders?position=1&sort=pts&statCategory=stats&statSeason=2019&statType=weekStats&statWeek=1"
x = data.frame(read_html(url) %>% 
  html_nodes("table") %>% 
  html_table())

This works ok but gives really weird two row headers and when I try to add %>% slice(-1) to take out the top row it says I can't because it's a list. Would really like to figure out how to do this.


回答1:


Here's one solution. An explanation follows.

library(rvest)
library(tidyverse)

read_html(url) %>% 
  html_nodes("table") %>%  
  html_table(header = T) %>%
  simplify() %>% 
  first() %>% 
  setNames(paste0(colnames(.), as.character(.[1,]))) %>%
  slice(-1) 

Output of glimpse():

Observations: 25
Variables: 16
$ Rank          <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"…
$ Player        <chr> "Lamar Jackson QB - BAL", "Dak Prescott QB - DAL", "Deshaun W…
$ Opp           <chr> "@MIA", "NYG", "@NO", "@ARI", "@JAX", "@PHI", "PIT", "WAS", "…
$ PassingYds    <chr> "324", "405", "268", "385", "378", "380", "341", "313", "248"…
$ PassingTD     <chr> "5", "4", "3", "3", "3", "3", "3", "3", "3", "3", "2", "2", "…
$ PassingInt    <chr> "-", "-", "1", "-", "-", "-", "-", "-", "-", "1", "1", "1", "…
$ RushingYds    <chr> "6", "12", "40", "22", "2", "-", "-", "5", "24", "6", "13", "…
$ RushingTD     <chr> "-", "-", "1", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ ReceivingRec  <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ ReceivingYds  <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ ReceivingTD   <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ RetTD         <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ MiscFumTD     <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ Misc2PT       <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "1", "-", "…
$ FumLost       <chr> "-", "-", "-", "1", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ FantasyPoints <chr> "33.56", "33.40", "30.72", "27.60", "27.32", "27.20", "25.64"…

Explanation
From ?html_table docs:

html_table currently makes a few assumptions:

  • No cells span multiple rows
  • Headers are in the first row

Part of your problem is solved by setting header = TRUE in html_table().

Another part of the problem is that the header cells span two rows, which html_table() does not expect.

Assuming you don't want to lose the information in either header row, you can:

  1. Use simplify and first to pull out the data frame from the list you get from html_table
  2. Use setNames to merge the two header rows (which are now the data frame columns and the first row)
  3. Remove the first row (now redundant) with slice


来源:https://stackoverflow.com/questions/60235341/scrape-a-table-with-rvest-in-r-that-has-mismatch-table-heading

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!