Loop to scrape data from Wikipedia in R

后端 未结 2 1576
死守一世寂寞
死守一世寂寞 2021-01-21 04:43

I am trying to extract data about celebrity/notable deaths for analysis. Wikipedia has a very regular structure to their html paths concerning notable dates of death. It looks l

相关标签:
2条回答
  • 2021-01-21 05:15

    html_text(fnames) returns an array. Your problem is trying append an array onto a dataframe.
    Try converting your variable text to a dataframe before appending:

    for (y in 2015:2015){
      for (m in 1:12){
        site = read_html(paste("https://en.wikipedia.org/wiki/Deaths_in_",mlist[m],
               "_",y,collapse=""))
        fnames = html_nodes(site,"#mw-content-text h3+ ul li")
        text = html_text(fnames)
    
        temp<-data.frame(text, stringsAsFactors = FALSE)
    
        data = rbind(data,temp)
        }
     } 
    

    This is not the best technique for the performance reasons. Each time through the loop, the memory for the dataframe is reallocated which slows performance, with this being a one time event and a limit number of requests it should be manageable in this case.

    0 讨论(0)
  • 2021-01-21 05:28

    I wasn't able to get the same error that you got, but I think I know what you want to do.

    I have a feeling this has something to do with the unequal number of deaths in each month.

    I'd suggest doing it this way

    mlist = c("January","February","March","April","May","June","July","August",
          "September","October","November","December")
    
    for (y in 2015:2015){
      for (m in 1:12){
        site = read_html(paste("https://en.wikipedia.org/wiki/Deaths_in_",mlist[m],
                           "_",y,collapse=""))
        fnames = html_nodes(site,"#mw-content-text h3+ ul li")
        text = html_text(fnames)
        assign(mlist[m],text)
      }
    }
    

    This creates a character list for each month's deaths.

    An alternative (for easier use later in a loop to join them) is to use a list:

    data = vector("list",12)
    mlist = c("January","February","March","April","May","June","July","August",
          "September","October","November","December")
    
    for (y in 2015:2015){
      for (m in 1:12){
        site = read_html(paste("https://en.wikipedia.org/wiki/Deaths_in_",mlist[m],
                           "_",y,collapse=""))
        fnames = html_nodes(site,"#mw-content-text h3+ ul li")
        text = html_text(fnames)
        data[[m]] = text
      }
    }
    

    Personally, I don't like dealing with lists in R. But this seems to be the best work around.

    0 讨论(0)
提交回复
热议问题