I am trying to scrape a large amount of web pages to later analyse them. Since the number of URLs is huge, I had decided to use the parallel
package along with
You can use getURIAsynchronous
from Rcurl package that allows the caller to specify multiple URIs to download at the same time.
library(RCurl)
library(XML)
get.asynch <- function(urls){
txt <- getURIAsynchronous(urls)
## this part can be easily parallelized
## I am juste using lapply here as first attempt
res <- lapply(txt,function(x){
doc <- htmlParse(x,asText=TRUE)
xpathSApply(doc,"/html/body/h2[2]",xmlValue)
})
}
get.synch <- function(urls){
lapply(urls,function(x){
doc <- htmlParse(x)
res2 <- xpathSApply(doc,"/html/body/h2[2]",xmlValue)
res2
})}
Here some benchmarking for 100 urls you divide the parsing time by a factor of 2.
library(microbenchmark)
uris = c("http://www.omegahat.org/RCurl/index.html")
urls <- replicate(100,uris)
microbenchmark(get.asynch(urls),get.synch(urls),times=1)
Unit: seconds
expr min lq median uq max neval
get.asynch(urls) 22.53783 22.53783 22.53783 22.53783 22.53783 1
get.synch(urls) 39.50615 39.50615 39.50615 39.50615 39.50615 1