Using parallelisation to scrape web pages with R

前端 未结 1 512
醉梦人生
醉梦人生 2021-01-02 17:00

I am trying to scrape a large amount of web pages to later analyse them. Since the number of URLs is huge, I had decided to use the parallel package along with

相关标签:
1条回答
  • 2021-01-02 17:06

    You can use getURIAsynchronous from Rcurl package that allows the caller to specify multiple URIs to download at the same time.

    library(RCurl)
    library(XML)
    get.asynch <- function(urls){
      txt <- getURIAsynchronous(urls)
      ## this part can be easily parallelized 
      ## I am juste using lapply here as first attempt
      res <- lapply(txt,function(x){
        doc <- htmlParse(x,asText=TRUE)
        xpathSApply(doc,"/html/body/h2[2]",xmlValue)
      })
    }
    
    get.synch <- function(urls){
      lapply(urls,function(x){
        doc <- htmlParse(x)
        res2 <- xpathSApply(doc,"/html/body/h2[2]",xmlValue)
        res2
      })}
    

    Here some benchmarking for 100 urls you divide the parsing time by a factor of 2.

    library(microbenchmark)
    uris = c("http://www.omegahat.org/RCurl/index.html")
    urls <- replicate(100,uris)
    microbenchmark(get.asynch(urls),get.synch(urls),times=1)
    
    Unit: seconds
                 expr      min       lq   median       uq      max neval
     get.asynch(urls) 22.53783 22.53783 22.53783 22.53783 22.53783     1
      get.synch(urls) 39.50615 39.50615 39.50615 39.50615 39.50615     1
    
    0 讨论(0)
提交回复
热议问题