Using parallelisation to scrape web pages with R

前端 未结 1 513
醉梦人生
醉梦人生 2021-01-02 17:00

I am trying to scrape a large amount of web pages to later analyse them. Since the number of URLs is huge, I had decided to use the parallel package along with

1条回答
  •  栀梦
    栀梦 (楼主)
    2021-01-02 17:06

    You can use getURIAsynchronous from Rcurl package that allows the caller to specify multiple URIs to download at the same time.

    library(RCurl)
    library(XML)
    get.asynch <- function(urls){
      txt <- getURIAsynchronous(urls)
      ## this part can be easily parallelized 
      ## I am juste using lapply here as first attempt
      res <- lapply(txt,function(x){
        doc <- htmlParse(x,asText=TRUE)
        xpathSApply(doc,"/html/body/h2[2]",xmlValue)
      })
    }
    
    get.synch <- function(urls){
      lapply(urls,function(x){
        doc <- htmlParse(x)
        res2 <- xpathSApply(doc,"/html/body/h2[2]",xmlValue)
        res2
      })}
    

    Here some benchmarking for 100 urls you divide the parsing time by a factor of 2.

    library(microbenchmark)
    uris = c("http://www.omegahat.org/RCurl/index.html")
    urls <- replicate(100,uris)
    microbenchmark(get.asynch(urls),get.synch(urls),times=1)
    
    Unit: seconds
                 expr      min       lq   median       uq      max neval
     get.asynch(urls) 22.53783 22.53783 22.53783 22.53783 22.53783     1
      get.synch(urls) 39.50615 39.50615 39.50615 39.50615 39.50615     1
    

    0 讨论(0)
自定义标题
段落格式
字体
字号
代码语言
提交回复
热议问题