R: Webscraping a list of URLs to get a DataFrame

青春壹個敷衍的年華 提交于 2019-12-04 19:25:45

purrr::map_df is a version of lapply that coerces the result to a data.frame, which lets you do

library(tidyverse)
library(rvest)

urls <- list("http://simple.ripley.com.pe/tv-y-video/televisores/ver-todo-tv",
        "http://simple.ripley.com.pe/tv-y-video/televisores/ver-todo-tv?page=2&orderBy=seq",
        "http://simple.ripley.com.pe/tv-y-video/televisores/ver-todo-tv?page=3&orderBy=seq",
        "http://simple.ripley.com.pe/tv-y-video/televisores/ver-todo-tv?page=4&orderBy=seq")

h <- urls %>% map(read_html)    # scrape once, parse as necessary

h %>% map_df(~{
    r.precio.antes <- html_nodes(.x, 'span.catalog-product-list-price') %>% html_text
    r.precio.actual <- html_nodes(.x, 'span.catalog-product-offer-price') %>% html_text 

    data_frame(
        periodo = lubridate::year(Sys.Date()),
        fecha = Sys.Date(),
        ecommerce = "ripley",
        producto = html_nodes(.x, "span.catalog-product-name") %>% html_text,
        precio.antes = ifelse(length(r.precio.antes) == 0, NA, r.precio.antes),
        precio.actual = ifelse(length(r.precio.actual) == 0, NA,  r.precio.actual),
)})
#> # A tibble: 85 x 6
#>    periodo      fecha ecommerce                                 producto
#>      <dbl>     <date>     <chr>                                    <chr>
#>  1    2017 2017-05-16    ripley           LG SMART TV 43'' UHD 43UH6030 
#>  2    2017 2017-05-16    ripley       SAMSUNG SMART TV UHD 40" 40KU6000 
#>  3    2017 2017-05-16    ripley       SAMSUNG SMART TV UHD 50" 50KU6000 
#>  4    2017 2017-05-16    ripley SAMSUNG SMART TV UHD 49" CURVO 49KU6300 
#>  5    2017 2017-05-16    ripley SAMSUNG SMART TV UHD 55" CURVO 55KU6300 
#>  6    2017 2017-05-16    ripley SAMSUNG SMART TV UHD 55" CURVO 55KU6500 
#>  7    2017 2017-05-16    ripley SAMSUNG SMART TV UHD 65" CURVO 65KU6500 
#>  8    2017 2017-05-16    ripley           LG SMART TV UHD 49'' 49UH6500 
#>  9    2017 2017-05-16    ripley           LG SMART TV 55'' UHD 55UH6030 
#> 10    2017 2017-05-16    ripley       LG SMART TV OLED 4K 55" OLED55B6P 
#> # ... with 75 more rows, and 2 more variables: precio.antes <chr>,
#> #   precio.actual <chr>

Alternatively, to fix what the list of matrices in base R, where x is the list resulting from the code above,

df <- as.data.frame(do.call(rbind, lapply(x, t)), stringsAsFactors = FALSE)
# or df <- as.data.frame(t(do.call(cbind, x)), stringsAsFactors = FALSE) 
df[] <- lapply(df, type.convert, as.is = TRUE)
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!