scrape HTML table with multiple pages using R

前端未结

关注

 1  1022

I am trying to make a data frame by scraping from the web. But there are multiple pages that make up the table I am trying to scrape. same link, but page is different.

相关标签:

1条回答

天涯浪人

2021-02-10 20:50

You can dynamically create the url using paste0 since that they slightly differ. For a certain year you change just the page number. You get an url structure like :

url <- paste0(url1,year,url2,page,url3) ## you change page or year or both

You can create a function to loop over different page, and return a table. Then you can bind them using the classic do.call(rbind,..):

library(XML)
url1 <- "http://www.nfl.com/stats/categorystats?tabSeq=1&season="
year <- 2013
url2 <- "&seasonType=REG&experience=&Submit=Go&archive=false&conference=null&d-447263-p="
page <- 1
url3 <- "&statisticPositionCategory=DEFENSIVE_BACK&qualified=true"

getTable <- 
  function(page=1,year=2013){
    url <- paste0(url1,year,url2,page,url3)
    tab = readHTMLTable(url,header=FALSE) ## see comment !!
    tab$result
}
## this will merge all tables in a single big table
do.call(rbind,lapply(seq_len(8),getTable,year=2013))

the general method

The general method is to scrap the next page url using some xpath tag and loop till to not have any new next page. This is can be more difficult to do but it is the cleanest solution .

getNext <- 
function(url=url_base){
  doc <- htmlParse(url)
  XPATH_NEXT = "//*[@class='linkNavigation floatRight']/*[contains(., 'next')]"
  next_page <- unique(xpathSApply(doc,XPATH_NEXT,xmlGetAttr,'href'))
  if(length(next_page)>0)
    paste0("http://www.nfl.com",next_page)
  else ''
}
## url_base is your first  url
res <- list()
while(TRUE){
  tab = readHTMLTable(url_base,header=FALSE)
  res <- rbind(res,tab$result)
  url_base <- getNext(url_base)
  if (nchar(url_base)==0)
    break
}

0 讨论(0)