Scraping html tables into R data frames using the XML package

前端 未结 4 505
野的像风
野的像风 2020-11-22 07:17

How do I scrape html tables using the XML package?

Take, for example, this wikipedia page on the Brazilian soccer team. I would like to read it in R and get the \"li

4条回答
  •  南笙
    南笙 (楼主)
    2020-11-22 07:56

    Another option using Xpath.

    library(RCurl)
    library(XML)
    
    theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
    webpage <- getURL(theurl)
    webpage <- readLines(tc <- textConnection(webpage)); close(tc)
    
    pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
    
    # Extract table header and contents
    tablehead <- xpathSApply(pagetree, "//*/table[@class='wikitable sortable']/tr/th", xmlValue)
    results <- xpathSApply(pagetree, "//*/table[@class='wikitable sortable']/tr/td", xmlValue)
    
    # Convert character vector to dataframe
    content <- as.data.frame(matrix(results, ncol = 8, byrow = TRUE))
    
    # Clean up the results
    content[,1] <- gsub(" ", "", content[,1])
    tablehead <- gsub(" ", "", tablehead)
    names(content) <- tablehead
    

    Produces this result

    > head(content)
       Opponent Played Won Drawn Lost Goals for Goals against % Won
    1 Argentina     94  36    24   34       148           150 38.3%
    2  Paraguay     72  44    17   11       160            61 61.1%
    3   Uruguay     72  33    19   20       127            93 45.8%
    4     Chile     64  45    12    7       147            53 70.3%
    5      Peru     39  27     9    3        83            27 69.2%
    6    Mexico     36  21     6    9        69            34 58.3%
    

提交回复
热议问题