How to isolate a single element from a scraped web page in R

前端 未结 1 1163
北海茫月
北海茫月 2020-12-25 08:23

I want to use R to scrape this page: (http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html ) and others, to get the goal scorers and t

相关标签:
1条回答
  • 2020-12-25 08:43

    These questions are very helpful when dealing with web scraping and XML in R:

    1. Scraping html tables into R data frames using the XML package
    2. How to transform XML data into a data.frame?

    With regards to your particular example, while I'm not sure what you want the output to look like, this gets the "goals scored" as a character vector:

    theURL <-"http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html"
    fifa.doc <- htmlParse(theURL)
    fifa <- xpathSApply(fifa.doc, "//*/div[@class='cont']", xmlValue)
    goals.scored <- grep("Goals scored", fifa, value=TRUE)
    

    The xpathSApply function gets all the values that match the given criteria, and returns them as a vector. Note how I'm looking for a div with class='cont'. Using class values is frequently a good way to parse an HTML document because they are good markers.

    You can clean this up however you want:

    > gsub("Goals scored", "", strsplit(goals.scored, ", ")[[1]])
    [1] "Philipp LAHM (GER) 6'"    "Paulo WANCHOPE (CRC) 12'" "Miroslav KLOSE (GER) 17'" "Miroslav KLOSE (GER) 61'" "Paulo WANCHOPE (CRC) 73'"
    [6] "Torsten FRINGS (GER) 87'"
    
    0 讨论(0)
提交回复
热议问题