Web Scraping Basketball Reference using R

半世苍凉 提交于 2019-12-10 22:48:39

问题


I'm interested in extracting the player tables on basketball-reference.com. I have successfully extracted the per game statistics table for a specific player (i.e. LeBron James, as an example), which is the first table listed on the web page. However, there are 10+ tables on the page that I can't seem to extract. I've been able to get the table into R a couple different ways. First, using the rvest package:

library(rvest)
lebron <- "https://www.basketball-reference.com/players/j/jamesle01.html"
lebron_webpage <- read_html(lebron)
lebron_table <- html_table(lebron_webpage, fill = TRUE)
lebron_pergame <- data.frame(lebron_table)

Now I have LeBron's per game statistics from his career in a nice data frame. I'm also able to read the same table in using a combination of the XML and RCurl package.

library(RCurl)
library(XML)
lebron_url <- paste0(lebron)
lebron_url <- getURL(lebron_url)
lebron_table <- readHTMLTable(lebron_url, which = 1)

The problem comes if I want to read in an other table on the page. For example, the next table on the page is Totals. I've tried using a CSS selector to select the specific table I want to read in, but I can't get that to work. I've also tried to right click, inspect element on the page and copy the XPath for the table, but I also can't get that to work. I've spent a lot of time researching this issue on Google, but can't seem to find anything that solves this problem. Any help would be greatly appreciated! Thanks in advance!


回答1:


The following tables are loaded dynamically (js). So you have many possibilities to extract your tables.

Using RSelenium to simulate user navigation :

library(rvest)
library(RSelenium)
lebron <- "https://www.basketball-reference.com/players/j/jamesle01.html"
rD <- rsDriver()
remDr <- rD[["client"]]
remDr$navigate(lebron)
lebron_webpage <- read_html(remDr$getPageSource()[[1]])
lebron_table <- html_table(lebron_webpage, fill = TRUE)

for (i in 1:length(lebron_table)) 
assign(paste0("table_",i),data.frame(lebron_table[i]))
#You can rename your table by a title to be more explicit

Another way is to gather the js transaction and see if you can get the json results.

Hope that will helps

Gottavianoni



来源:https://stackoverflow.com/questions/48778493/web-scraping-basketball-reference-using-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!