问题
Slightly new to r and I've been working on a project (just for fun) to help me learn and I'm running into something that I can't seem to find answers for online. I am trying to teach myself to scrape websites for data, and I've started with the code below that retrieves some data from 247 sports.
library(rvest)
library(stringr)
link <- "https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank"
link.scrap <- read_html(link)
data <-
html_nodes(x = link.scrap,
css = '#page-content > div.main-div.clearfix > section.list-page > section > section > ul.content-list.ri-list > li:nth-child(3)') %>%
html_text(trim = TRUE) %>%
trimws()
When I view the data it appears to be a vector of length 1, with multiple list items stored as one value. The problem I'm running into is trying to separate these out into their respective columns. For example, when I run the code below which I think should split the data at ")" and then remove the white spaces from both of the resulting values, I get a weird result.
f<-strsplit(data,")")
str_trim(f)
[1] "c(\"Ray Lima El Camino College (Torrance, CA\", \" DT 6-3 310 0.8681 39 4 9 Enrolled 1/9/2017\")"
I have messed around with a few other things but with no success. So I guess my question is, what would be the best way to take data from this html list and get it into a format where every data point has it's own column (i.e. name, college, position, stats, etc)?
回答1:
I've modified a couple of things in your code.
Taken a generic approach to refer the css and hence able to extract for the entire rows.
Collected individual columns as vectors and then built a dataframe
Please check
library(rvest)
library(stringr)
library(tidyr)
link <- "https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank"
link.scrap <- read_html(link)
names <- link.scrap %>% html_nodes('div.name') %>% html_text()
pos <- link.scrap %>% html_nodes('ul.metrics-list') %>% html_text()
status <- link.scrap %>% html_nodes('div.right-content.right') %>% html_text()
data <- data.frame(names,pos,status, stringsAsFactors = F)
data <- data[-1,]
head(data)
> head(data)
names pos status
2 Kamilo Tongamoa Merced College (Merced, CA) DT 6-5 320 Enrolled 8/24/2017
3 Ray Lima El Camino College (Torrance, CA) DT 6-3 310 Enrolled 1/9/2017
4 O'Rien Vance George Washington (Cedar Rapids, IA) OLB 6-3 235 Enrolled 6/12/2017
5 Matt Leo Arizona Western College (Yuma, AZ) WDE 6-7 265 Enrolled 2/22/2017
6 Keontae Jones Colerain (Cincinnati, OH) S 6-1 175 Enrolled 6/12/2017
7 Cordarrius Bailey Clarksdale (Clarksdale, MS) WDE 6-4 210 Enrolled 6/12/2017
>
回答2:
The basic problem is that the web page contains what looks like a table, but it is really a list with lots of styling. That means you need to work through each element, pull out the relevant nodes and further process the node content as required.
First, grab the whole list:
library(dplyr)
library(rvest)
iowa_state <- read_html("https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank") %>%
html_nodes('ul.content-list.ri-list')
Extract the metrics (position, height, weight). This creates a vector where the first 3 elements are the headers (Pos, Ht, Wt), then metrics for each player fill the other elements three at a time.
metrics <- iowa_state %>%
html_nodes("ul.metrics-list li") %>%
html_text() %>%
trimws()
Extract the status ("enrolled" and date). This creates a vector where "Enrolled" fills elements 1, 3, 5... and the date fills elements 2, 4, 6...
status <- iowa_state %>%
html_nodes("p.commit-date") %>%
html_text() %>%
trimws()
Now we can build up a data frame (or tibble) column by column:
iowa_state_df <- tibble(name = iowa_state %>% html_nodes("a.name") %>% html_text(),
college = iowa_state %>% html_nodes("span.meta") %>% html_text() %>% trimws(),
pos = metrics[seq(4, length(metrics)-2, 3)],
ht = metrics[seq(5, length(metrics)-1, 3)],
wt = metrics[seq(6, length(metrics), 3)],
score = iowa_state %>% html_nodes("span.score") %>% html_text(),
natrank = iowa_state %>% html_nodes("div.rank a.natrank") %>% html_text(),
posrank = iowa_state %>% html_nodes("div.rank a.posrank") %>% html_text(),
sttrank = iowa_state %>% html_nodes("div.rank a.sttrank") %>% html_text(),
enrolled = status[seq(1, length(status)-1, 2)],
date = status[seq(2, length(status), 2)]
)
glimpse(iowa_state_df)
Observations: 26
Variables: 11
$ name <chr> "Kamilo Tongamoa", "Ray Lima", "O'Rien Vance", "Matt Leo", "Keontae Jones", "Cordarriu...
$ college <chr> "Merced College (Merced, CA)", "El Camino College (Torrance, CA)", "George Washington ...
$ pos <chr> "DT", "DT", "OLB", "WDE", "S", "WDE", "WR", "CB", "CB", "DUAL", "SDE", "OT", "OT", "WR...
$ ht <chr> "6-5", "6-3", "6-3", "6-7", "6-1", "6-4", "5-11", "6-1", "6-0.5", "6-4", "6-3", "6-5",...
$ wt <chr> "320", "310", "235", "265", "175", "210", "170", "190", "170", "221", "250", "260", "3...
$ score <chr> "0.8742", "0.8681", "0.8681", "0.8656", "0.8624", "0.8546", "0.8515", "0.8482", "0.847...
$ natrank <chr> "28", "39", "508", "48", "587", "724", "806", "885", "924", "928", "929", "NA", "NA", ...
$ posrank <chr> "3", "4", "29", "5", "42", "42", "117", "91", "100", "19", "42", "88", "90", "12", "57...
$ sttrank <chr> "5", "9", "4", "7", "25", "13", "9", "124", "20", "8", "6", "10", "24", "37", "20", "1...
$ enrolled <chr> "Enrolled", "Enrolled", "Enrolled", "Enrolled", "Enrolled", "Enrolled", "Enrolled", "E...
$ date <chr> "8/24/2017", "1/9/2017", "6/12/2017", "2/22/2017", "6/12/2017", "6/12/2017", "6/12/201...
You could then format the type of the columns (date, numeric etc.) as required.
来源:https://stackoverflow.com/questions/48374625/cleaning-data-scraped-from-web