scrape HTML table with multiple pages using R

前端 未结 1 1022
温柔的废话
温柔的废话 2021-02-10 20:04

I am trying to make a data frame by scraping from the web. But there are multiple pages that make up the table I am trying to scrape. same link, but page is different.

f

相关标签:
1条回答
  • 2021-02-10 20:50

    You can dynamically create the url using paste0 since that they slightly differ. For a certain year you change just the page number. You get an url structure like :

    url <- paste0(url1,year,url2,page,url3) ## you change page or year or both
    

    You can create a function to loop over different page, and return a table. Then you can bind them using the classic do.call(rbind,..):

    library(XML)
    url1 <- "http://www.nfl.com/stats/categorystats?tabSeq=1&season="
    year <- 2013
    url2 <- "&seasonType=REG&experience=&Submit=Go&archive=false&conference=null&d-447263-p="
    page <- 1
    url3 <- "&statisticPositionCategory=DEFENSIVE_BACK&qualified=true"
    
    getTable <- 
      function(page=1,year=2013){
        url <- paste0(url1,year,url2,page,url3)
        tab = readHTMLTable(url,header=FALSE) ## see comment !!
        tab$result
    }
    ## this will merge all tables in a single big table
    do.call(rbind,lapply(seq_len(8),getTable,year=2013))
    

    the general method

    The general method is to scrap the next page url using some xpath tag and loop till to not have any new next page. This is can be more difficult to do but it is the cleanest solution .

    getNext <- 
    function(url=url_base){
      doc <- htmlParse(url)
      XPATH_NEXT = "//*[@class='linkNavigation floatRight']/*[contains(., 'next')]"
      next_page <- unique(xpathSApply(doc,XPATH_NEXT,xmlGetAttr,'href'))
      if(length(next_page)>0)
        paste0("http://www.nfl.com",next_page)
      else ''
    }
    ## url_base is your first  url
    res <- list()
    while(TRUE){
      tab = readHTMLTable(url_base,header=FALSE)
      res <- rbind(res,tab$result)
      url_base <- getNext(url_base)
      if (nchar(url_base)==0)
        break
    }
    
    0 讨论(0)
提交回复
热议问题