Using parallelisation to scrape web pages with R

前端未结

关注

 1  513

醉梦人生 2021-01-02 17:00

I am trying to scrape a large amount of web pages to later analyse them. Since the number of URLs is huge, I had decided to use the parallel package along with

1条回答

栀梦 (楼主)

2021-01-02 17:06

You can use getURIAsynchronous from Rcurl package that allows the caller to specify multiple URIs to download at the same time.

library(RCurl)
library(XML)
get.asynch <- function(urls){
  txt <- getURIAsynchronous(urls)
  ## this part can be easily parallelized 
  ## I am juste using lapply here as first attempt
  res <- lapply(txt,function(x){
    doc <- htmlParse(x,asText=TRUE)
    xpathSApply(doc,"/html/body/h2[2]",xmlValue)
  })
}

get.synch <- function(urls){
  lapply(urls,function(x){
    doc <- htmlParse(x)
    res2 <- xpathSApply(doc,"/html/body/h2[2]",xmlValue)
    res2
  })}

Here some benchmarking for 100 urls you divide the parsing time by a factor of 2.

library(microbenchmark)
uris = c("http://www.omegahat.org/RCurl/index.html")
urls <- replicate(100,uris)
microbenchmark(get.asynch(urls),get.synch(urls),times=1)

Unit: seconds
             expr      min       lq   median       uq      max neval
 get.asynch(urls) 22.53783 22.53783 22.53783 22.53783 22.53783     1
  get.synch(urls) 39.50615 39.50615 39.50615 39.50615 39.50615     1

0 讨论(0)