Scraping data from tables on multiple web pages in R (football players)

后端 未结 1 1351
梦谈多话
梦谈多话 2020-12-30 13:18

I\'m working on a project for school where I need to collect the career statistics for individual NCAA football players. The data for each player is in this format.

相关标签:
1条回答
  • 2020-12-30 13:58

    Here's how you can easily get all the data in all the tables on all the player pages...

    First make a list of the URLs for all the players' pages...

    require(RCurl); require(XML)
    n <- length(letters) 
    # pre-allocate list to fill
    links <- vector("list", length = n)
    for(i in 1:n){
      print(i) # keep track of what the function is up to
      # get all html on each page of the a-z index pages
      inx_page <- htmlParse(getURI(paste0("http://www.sports-reference.com/cfb/players/", letters[i], "-index.html")))
      # scrape URLs for each player from each index page
      lnk <- unname(xpathSApply(inx_page, "//a/@href"))
      # skip first 63 and last 10 links as they are constant on each page
      lnk <- lnk[-c(1:63, (length(lnk)-10):length(lnk))]
      # only keep links that go to players (exclude schools)
      lnk <- lnk[grep("players", lnk)]
      # now we have a list of all the URLs to all the players on that index page
      # but the URLs are incomplete, so let's complete them so we can use them from 
      # anywhere
      links[[i]] <- paste0("http://www.sports-reference.com", lnk)
    }
    # unlist into a single character vector
    links <- unlist(links)
    

    Now we have a vector of some 67,000 URLs (seems like a lot of players, can that be right?), so:

    Second, scrape all the tables at each URL to get their data, like so:

    # Go to each URL in the list and scrape all the data from the tables
    # this will take some time... don't interrupt it!
    # start edit1 here - just so you can see what's changed
        # pre-allocate list
    all_tables <- vector("list", length = (length(links)))
    for(i in 1:length(links)){
      print(i)
      # error handling - skips to next URL if it gets an error
      result <- try(
        all_tables[[i]] <- readHTMLTable(links[i], stringsAsFactors = FALSE)
      ); if(class(result) == "try-error") next;
    }
    # end edit1 here
    # Put player names in the list so we know who the data belong to
    # extract names from the URLs to their stats page...
    toMatch <- c("http://www.sports-reference.com/cfb/players/", "-1.html")
    player_names <- unique (gsub(paste(toMatch,collapse="|"), "", links))
    # assign player names to list of tables
    names(all_tables) <- player_names
    

    The result looks like this (this is just a snippet of the output):

    all_tables
    $`neli-aasa`
    $`neli-aasa`$defense
       Year School Conf Class Pos Solo Ast Tot Loss  Sk Int Yds Avg TD PD FR Yds TD FF
    1 *2007   Utah  MWC    FR  DL    2   1   3  0.0 0.0   0   0      0  0  0   0  0  0
    2 *2010   Utah  MWC    SR  DL    4   4   8  2.5 1.5   0   0      0  1  0   0  0  0
    
    $`neli-aasa`$kick_ret
       Year School Conf Class Pos Ret Yds  Avg TD Ret Yds Avg TD
    1 *2007   Utah  MWC    FR  DL   0   0       0   0   0      0
    2 *2010   Utah  MWC    SR  DL   2  24 12.0  0   0   0      0
    
    $`neli-aasa`$receiving
       Year School Conf Class Pos Rec Yds  Avg TD Att Yds Avg TD Plays Yds  Avg TD
    1 *2007   Utah  MWC    FR  DL   1  41 41.0  0   0   0      0     1  41 41.0  0
    2 *2010   Utah  MWC    SR  DL   0   0       0   0   0      0     0   0       0
    

    Finally, let's say we just want to look at the passing tables...

    # just show passing tables
    passing <- lapply(all_tables, function(i) i$passing)
    # but lots of NULL in here, and not a convenient format, so...
    passing <- do.call(rbind, passing)
    

    And we end up with a data frame that is ready for further analyses (also just a snippet)...

                 Year             School Conf Class Pos Cmp Att  Pct  Yds Y/A AY/A TD Int  Rate
    james-aaron  1978          Air Force  Ind        QB  28  56 50.0  316 5.6  3.6  1   3  92.6
    jeff-aaron.1 2000 Alabama-Birmingham CUSA    JR  QB 100 182 54.9 1135 6.2  6.0  5   3 113.1
    jeff-aaron.2 2001 Alabama-Birmingham CUSA    SR  QB  77 148 52.0  828 5.6  4.3  4   6  99.8
    
    0 讨论(0)
提交回复
热议问题