Loop to scrape data from Wikipedia in R

后端未结

关注

 2  1580

死守一世寂寞 2021-01-21 04:43

I am trying to extract data about celebrity/notable deaths for analysis. Wikipedia has a very regular structure to their html paths concerning notable dates of death. It looks l

2条回答

春和景丽 (楼主)

2021-01-21 05:28

I wasn't able to get the same error that you got, but I think I know what you want to do.

I have a feeling this has something to do with the unequal number of deaths in each month.

I'd suggest doing it this way

mlist = c("January","February","March","April","May","June","July","August",
      "September","October","November","December")

for (y in 2015:2015){
  for (m in 1:12){
    site = read_html(paste("https://en.wikipedia.org/wiki/Deaths_in_",mlist[m],
                       "_",y,collapse=""))
    fnames = html_nodes(site,"#mw-content-text h3+ ul li")
    text = html_text(fnames)
    assign(mlist[m],text)
  }
}

This creates a character list for each month's deaths.

An alternative (for easier use later in a loop to join them) is to use a list:

data = vector("list",12)
mlist = c("January","February","March","April","May","June","July","August",
      "September","October","November","December")

for (y in 2015:2015){
  for (m in 1:12){
    site = read_html(paste("https://en.wikipedia.org/wiki/Deaths_in_",mlist[m],
                       "_",y,collapse=""))
    fnames = html_nodes(site,"#mw-content-text h3+ ul li")
    text = html_text(fnames)
    data[[m]] = text
  }
}

Personally, I don't like dealing with lists in R. But this seems to be the best work around.

0 讨论(0)

查看其它2个回答