Trying to use rvest to loop a command to scrape tables from multiple pages

问题

I'm trying to scrape HTML tables from different football teams. Here is the table I want to scrape, however I want to scrape that same table from all of the teams to ultimately create a single CSV file that has the player names and their data.

http://www.pro-football-reference.com/teams/tam/2016_draft.htm

# teams
teams <- c("ATL", "TAM", "NOR", "CAR", "GNB", "DET", "MIN", "CHI", "SEA", "CRD", "RAM", "NWE", "MIA", "BUF", "NYJ", "KAN", "RAI", "DEN", "SDG", "PIT", "RAV", "SFO", "CIN", "CLE", "HTX", "OTI", "CLT", "JAX", "DAL", "NYG", "WAS", "PHI")

# loop
for(i in teams) {
  url <-paste0("http://www.pro-football-reference.com/teams/", i,"/2016-snap-counts.htm#snap_counts::none", sep="")
  webpage <- read_html(url)

  # grab table
  sb_table <- html_nodes(webpage, 'table')
html_table(sb_table)
head(sb_table)
  # bind to dataframe
df <- rbind(df, sb_table)
}

I'm getting an error thought that I should use CSS or Xpath and not both, but I can't figure out where the problem is exactly (I suspect the html_nodes command). Can anyone help me fix this problem?

回答1:

I think that your urls are badly built and, in addition, that the names of the teams are case sensitive. Could you try something like this instead ?

library(rvest)
library(magrittr)

# teams
teams <- c("ATL", "TAM", "NOR", "CAR", "GNB", "DET", "MIN", "CHI", "SEA", "CRD", "RAM", "NWE", "MIA", "BUF", "NYJ", "KAN", "RAI", "DEN", "SDG", "PIT", "RAV", "SFO", "CIN", "CLE", "HTX", "OTI", "CLT", "JAX", "DAL", "NYG", "WAS", "PHI")

tables <- list()
index <- 1
for(i in teams){
  try({
  url <- paste0("http://www.pro-football-reference.com/teams/", tolower(i), "/2016_draft.htm")
  table <- url %>% 
    read_html() %>% 
    html_table(fill = TRUE)

  tables[index] <- table

  index <- index + 1

  })
}

df <- do.call("rbind", tables)

PS: I do not understand why this question is downvoted. It seems well formulated ...

回答2:

I think the appropriate CSS selector in this case is #snap_counts. Also if there is one table per page, you can use html_node() (singular, not nodes):

url %>% 
  read_html() %>% 
  html_node("#snap_counts") %>% 
  html_table(header = FALSE)

Since the table has two header rows and some header cells span columns, it's probably best to use header = FALSE. The first 2 rows of the data frame will contain the headers and you can clean up manually (create your own column names).

来源：https://stackoverflow.com/questions/42356491/trying-to-use-rvest-to-loop-a-command-to-scrape-tables-from-multiple-pages

标签

web-scraping

rvest