fast url query with R

。_饼干妹妹 提交于 2019-12-21 05:10:43

问题


Hi have to query a website 10000 times I am looking for a real fast way to do it with R

as a template url:

url <- "http://mutationassessor.org/?cm=var&var=7,55178574,G,A"

my code is:

url  <- mydata$mutationassessorurl[1]
rawurl  <- readHTMLTable(url)
Mutator  <- data.frame(rawurl[[10]])

for(i in 2:27566) {
  url  <- mydata$mutationassessorurl[i]
  rawurl  <- readHTMLTable(url)
  Mutator  <- smartbind(Mutator, data.frame(rawurl[[10]]))  
  print(i)
}

using microbenchmark I have 680 milliseconds for query. I was wondering if there is a faster way to do it!

Thanks


回答1:


One way to speed up http connections is to leave the connection open between requests. The following example shows the difference it makes for httr. The first option is most similar to the default behaviour in RCurl.

library(httr)
test_server <- "http://had.co.nz"

# Return times in ms for easier comparison
timed_GET <- function(...) {
  req <- GET(...)
  round(req$times * 1000)
}

# Create a new handle for every request - no connection sharing
rowMeans(replicate(20, 
  timed_GET(handle = handle(test_server), path = "index.html")
))

##      redirect    namelookup       connect   pretransfer starttransfer 
##          0.00         20.65         75.30         75.40        133.20 
##         total 
##        135.05

test_handle <- handle(test_server)
# Re use the same handle for multiple requests
rowMeans(replicate(20, 
  timed_GET(handle = test_handle, path = "index.html")
))

##      redirect    namelookup       connect   pretransfer starttransfer 
##          0.00          0.00          2.55          2.55         59.35 
##         total 
##         60.80

# With httr, handles are automatically pooled
rowMeans(replicate(20,
  timed_GET(test_server, path = "index.html")
))

##      redirect    namelookup       connect   pretransfer starttransfer 
##          0.00          0.00          2.55          2.55         57.75 
##         total 
##         59.40

Note the difference in the namelookup and connect - if you're sharing a handle you need to do each of these operations only once, which saves quite a bit of time.

There's quite a lot of intra-request variation - on average the last two methods should be very similar.



来源:https://stackoverflow.com/questions/22940150/fast-url-query-with-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!