Web Scraping in R with loop from data.frame

我怕爱的太早我们不能终老 提交于 2020-03-06 04:38:49

问题


library(rvest)

df <- data.frame(Links = c("Qmobile_Noir-M6", "Qmobile_Noir-A1", "Qmobile_Noir-E8"))

for(i in 1:3) {
  webpage <- read_html(paste0("https://www.whatmobile.com.pk/", df$Links[i]))
  data <- webpage %>%
    html_nodes(".specs") %>%
    .[[1]] %>% 
    html_table(fill = TRUE)
}

want to make loop works for all 3 values in df$Links but above code just download the last one, and downloaded data must also be identical with variables (may be a new column with variables name)


回答1:


The problem is in how you're structuring your for loop. It's much easier just to not use one in the first place, though, as R has great support for iterating over lists, like lapply and purrr::map. One version of how you could structure your data:

library(tidyverse)
library(rvest)

base_url <- "https://www.whatmobile.com.pk/"

models <- data_frame(model = c("Qmobile_Noir-M6", "Qmobile_Noir-A1", "Qmobile_Noir-E8"),
           link = paste0(base_url, model),
           page = map(link, read_html))

model_specs <- models %>% 
    mutate(node = map(page, html_node, '.specs'),
           specs = map(node, html_table, header = TRUE, fill = TRUE),
           specs = map(specs, set_names, c('var1', 'var2', 'val1', 'val2'))) %>% 
    select(model, specs) %>% 
    unnest()

model_specs
#> # A tibble: 119 x 5
#>              model      var1       var2
#>              <chr>     <chr>      <chr>
#>  1 Qmobile_Noir-M6     Build         OS
#>  2 Qmobile_Noir-M6     Build Dimensions
#>  3 Qmobile_Noir-M6     Build     Weight
#>  4 Qmobile_Noir-M6     Build        SIM
#>  5 Qmobile_Noir-M6     Build     Colors
#>  6 Qmobile_Noir-M6 Frequency    2G Band
#>  7 Qmobile_Noir-M6 Frequency    3G Band
#>  8 Qmobile_Noir-M6 Frequency    4G Band
#>  9 Qmobile_Noir-M6 Processor        CPU
#> 10 Qmobile_Noir-M6 Processor    Chipset
#> # ... with 109 more rows, and 2 more variables: val1 <chr>, val2 <chr>

The data is still pretty messy, but at least it's all there.




回答2:


it is capturing all three values, but it writes over them with each loop. That's why it only shows one value, and that one value being for the last page

You need to initialise a variable first before you go into your loop, I suggest a list so you can store data for each successive loop. So something like

final_table <- list()

for(i in 1:3) {
   webpage <- read_html(paste0("https://www.whatmobile.com.pk/",   df$Links[i]))
   data <- webpage %>%
   html_nodes(".specs") %>%
   .[[1]] %>% 
html_table(fill= TRUE)

 final_table[[i]] <- data.frame(data, stringsAsFactors = F)
}

In this was, it appends new data to the list with each loop.



来源:https://stackoverflow.com/questions/44910955/web-scraping-in-r-with-loop-from-data-frame

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!