Cleaning Data Scraped from Web

问题

Slightly new to r and I've been working on a project (just for fun) to help me learn and I'm running into something that I can't seem to find answers for online. I am trying to teach myself to scrape websites for data, and I've started with the code below that retrieves some data from 247 sports.

library(rvest)
library(stringr)

link <- "https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank"

link.scrap <- read_html(link)
data <- 
  html_nodes(x   = link.scrap, 
             css = '#page-content > div.main-div.clearfix > section.list-page > section > section > ul.content-list.ri-list > li:nth-child(3)') %>%
  html_text(trim = TRUE) %>% 
  trimws()

When I view the data it appears to be a vector of length 1, with multiple list items stored as one value. The problem I'm running into is trying to separate these out into their respective columns. For example, when I run the code below which I think should split the data at ")" and then remove the white spaces from both of the resulting values, I get a weird result.

f<-strsplit(data,")")
str_trim(f)
[1] "c(\"Ray Lima  El Camino College (Torrance, CA\", \"         DT 6-3 310    0.8681      39 4 9       Enrolled   1/9/2017\")"

I have messed around with a few other things but with no success. So I guess my question is, what would be the best way to take data from this html list and get it into a format where every data point has it's own column (i.e. name, college, position, stats, etc)?

回答1:

I've modified a couple of things in your code.

Taken a generic approach to refer the css and hence able to extract for the entire rows.
Collected individual columns as vectors and then built a dataframe

Please check

library(rvest)
library(stringr)
library(tidyr)

link <- "https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank"

link.scrap <- read_html(link)

names <- link.scrap %>% html_nodes('div.name') %>% html_text()

pos <- link.scrap %>% html_nodes('ul.metrics-list') %>% html_text() 

status <- link.scrap %>% html_nodes('div.right-content.right') %>% html_text() 

data <- data.frame(names,pos,status, stringsAsFactors = F)

data <- data[-1,]

head(data)


> head(data)
                                                      names          pos                     status
2        Kamilo Tongamoa  Merced College (Merced, CA)        DT 6-5 320     Enrolled   8/24/2017   
3        Ray Lima  El Camino College (Torrance, CA)          DT 6-3 310      Enrolled   1/9/2017   
4  O'Rien Vance  George Washington (Cedar Rapids, IA)       OLB 6-3 235     Enrolled   6/12/2017   
5          Matt Leo  Arizona Western College (Yuma, AZ)     WDE 6-7 265     Enrolled   2/22/2017   
6            Keontae Jones  Colerain (Cincinnati, OH)         S 6-1 175     Enrolled   6/12/2017   
7      Cordarrius Bailey  Clarksdale (Clarksdale, MS)       WDE 6-4 210     Enrolled   6/12/2017   
>

回答2:

The basic problem is that the web page contains what looks like a table, but it is really a list with lots of styling. That means you need to work through each element, pull out the relevant nodes and further process the node content as required.

First, grab the whole list:

library(dplyr)
library(rvest)

iowa_state <- read_html("https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank") %>%
  html_nodes('ul.content-list.ri-list')

Extract the metrics (position, height, weight). This creates a vector where the first 3 elements are the headers (Pos, Ht, Wt), then metrics for each player fill the other elements three at a time.

metrics <- iowa_state %>% 
  html_nodes("ul.metrics-list li") %>% 
  html_text() %>% 
  trimws()

Extract the status ("enrolled" and date). This creates a vector where "Enrolled" fills elements 1, 3, 5... and the date fills elements 2, 4, 6...

status <- iowa_state %>% 
  html_nodes("p.commit-date") %>% 
  html_text() %>% 
  trimws()

Now we can build up a data frame (or tibble) column by column:

iowa_state_df <- tibble(name     = iowa_state %>% html_nodes("a.name") %>% html_text(),
                        college  = iowa_state %>% html_nodes("span.meta") %>% html_text() %>% trimws(),
                        pos      = metrics[seq(4, length(metrics)-2, 3)],
                        ht       = metrics[seq(5, length(metrics)-1, 3)],
                        wt       = metrics[seq(6, length(metrics), 3)],
                        score    = iowa_state %>% html_nodes("span.score") %>% html_text(),
                        natrank  = iowa_state %>% html_nodes("div.rank a.natrank") %>% html_text(),
                        posrank  = iowa_state %>% html_nodes("div.rank a.posrank") %>% html_text(),
                        sttrank  = iowa_state %>% html_nodes("div.rank a.sttrank") %>% html_text(),
                        enrolled = status[seq(1, length(status)-1, 2)],
                        date     = status[seq(2, length(status), 2)]
)

glimpse(iowa_state_df)

Observations: 26
Variables: 11
$ name     <chr> "Kamilo Tongamoa", "Ray Lima", "O'Rien Vance", "Matt Leo", "Keontae Jones", "Cordarriu...
$ college  <chr> "Merced College (Merced, CA)", "El Camino College (Torrance, CA)", "George Washington ...
$ pos      <chr> "DT", "DT", "OLB", "WDE", "S", "WDE", "WR", "CB", "CB", "DUAL", "SDE", "OT", "OT", "WR...
$ ht       <chr> "6-5", "6-3", "6-3", "6-7", "6-1", "6-4", "5-11", "6-1", "6-0.5", "6-4", "6-3", "6-5",...
$ wt       <chr> "320", "310", "235", "265", "175", "210", "170", "190", "170", "221", "250", "260", "3...
$ score    <chr> "0.8742", "0.8681", "0.8681", "0.8656", "0.8624", "0.8546", "0.8515", "0.8482", "0.847...
$ natrank  <chr> "28", "39", "508", "48", "587", "724", "806", "885", "924", "928", "929", "NA", "NA", ...
$ posrank  <chr> "3", "4", "29", "5", "42", "42", "117", "91", "100", "19", "42", "88", "90", "12", "57...
$ sttrank  <chr> "5", "9", "4", "7", "25", "13", "9", "124", "20", "8", "6", "10", "24", "37", "20", "1...
$ enrolled <chr> "Enrolled", "Enrolled", "Enrolled", "Enrolled", "Enrolled", "Enrolled", "Enrolled", "E...
$ date     <chr> "8/24/2017", "1/9/2017", "6/12/2017", "2/22/2017", "6/12/2017", "6/12/2017", "6/12/201...

You could then format the type of the columns (date, numeric etc.) as required.

来源：https://stackoverflow.com/questions/48374625/cleaning-data-scraped-from-web

标签

web-scraping

rvest