Web Scraping Basketball Reference using R

问题

I'm interested in extracting the player tables on basketball-reference.com. I have successfully extracted the per game statistics table for a specific player (i.e. LeBron James, as an example), which is the first table listed on the web page. However, there are 10+ tables on the page that I can't seem to extract. I've been able to get the table into R a couple different ways. First, using the rvest package:

library(rvest)
lebron <- "https://www.basketball-reference.com/players/j/jamesle01.html"
lebron_webpage <- read_html(lebron)
lebron_table <- html_table(lebron_webpage, fill = TRUE)
lebron_pergame <- data.frame(lebron_table)

Now I have LeBron's per game statistics from his career in a nice data frame. I'm also able to read the same table in using a combination of the XML and RCurl package.

library(RCurl)
library(XML)
lebron_url <- paste0(lebron)
lebron_url <- getURL(lebron_url)
lebron_table <- readHTMLTable(lebron_url, which = 1)

The problem comes if I want to read in an other table on the page. For example, the next table on the page is Totals. I've tried using a CSS selector to select the specific table I want to read in, but I can't get that to work. I've also tried to right click, inspect element on the page and copy the XPath for the table, but I also can't get that to work. I've spent a lot of time researching this issue on Google, but can't seem to find anything that solves this problem. Any help would be greatly appreciated! Thanks in advance!

回答1:

The following tables are loaded dynamically (js). So you have many possibilities to extract your tables.

Using RSelenium to simulate user navigation :

library(rvest)
library(RSelenium)
lebron <- "https://www.basketball-reference.com/players/j/jamesle01.html"
rD <- rsDriver()
remDr <- rD[["client"]]
remDr$navigate(lebron)
lebron_webpage <- read_html(remDr$getPageSource()[[1]])
lebron_table <- html_table(lebron_webpage, fill = TRUE)

for (i in 1:length(lebron_table)) 
assign(paste0("table_",i),data.frame(lebron_table[i]))
#You can rename your table by a title to be more explicit

Another way is to gather the js transaction and see if you can get the json results.

Hope that will helps

Gottavianoni

来源：https://stackoverflow.com/questions/48778493/web-scraping-basketball-reference-using-r

标签

xml

web-scraping

rvest

rcurl