I am trying to make a data frame by scraping from the web. But there are multiple pages that make up the table I am trying to scrape. same link, but page is different.
f
You can dynamically create the url using paste0
since that they slightly differ. For a certain year you change just the page number. You get an url structure like :
url <- paste0(url1,year,url2,page,url3) ## you change page or year or both
You can create a function to loop over different page, and return a table. Then you can bind them using the classic do.call(rbind,..)
:
library(XML)
url1 <- "http://www.nfl.com/stats/categorystats?tabSeq=1&season="
year <- 2013
url2 <- "&seasonType=REG&experience=&Submit=Go&archive=false&conference=null&d-447263-p="
page <- 1
url3 <- "&statisticPositionCategory=DEFENSIVE_BACK&qualified=true"
getTable <-
function(page=1,year=2013){
url <- paste0(url1,year,url2,page,url3)
tab = readHTMLTable(url,header=FALSE) ## see comment !!
tab$result
}
## this will merge all tables in a single big table
do.call(rbind,lapply(seq_len(8),getTable,year=2013))
The general method is to scrap the next page url using some xpath tag and loop till to not have any new next page. This is can be more difficult to do but it is the cleanest solution .
getNext <-
function(url=url_base){
doc <- htmlParse(url)
XPATH_NEXT = "//*[@class='linkNavigation floatRight']/*[contains(., 'next')]"
next_page <- unique(xpathSApply(doc,XPATH_NEXT,xmlGetAttr,'href'))
if(length(next_page)>0)
paste0("http://www.nfl.com",next_page)
else ''
}
## url_base is your first url
res <- list()
while(TRUE){
tab = readHTMLTable(url_base,header=FALSE)
res <- rbind(res,tab$result)
url_base <- getNext(url_base)
if (nchar(url_base)==0)
break
}