How to refresh or retry a specific web page using httr GET command?

和自甴很熟 提交于 2019-12-08 19:41:44

问题


I need to access the same web page with different "keys" to get specific content it provides.

I have a list of keys x and I use the GET command from httr package to access the web page and then retrieve the information I need y.

library(httr)
library(stringr)
library(XML)

for (i in 1:20){
    h1 = GET ( paste0("http:....categories=&query=", x[i]),timeout(10))
    par = htmlParse(file = h1)

    y[i]=xpathSApply(doc = par, path = "//h3/a" , fun=xmlValue)

}

The problem is that timeout is often reached, and it disrupts the loop.

So I would like to refresh the web page or retry the GET command if timeout is reached, because I suspect the problem is with the internet connection of the website I am trying to access.

The way my code works, timeout breaks the loop. I need to either ignore the error and go to next iteration or retry to access the website.


回答1:


Look at purrr::safely(). You can wrap GET as such:

safe_GET <- purrr::safely(GET)

This removes the ugliness of tryCatch() by letting you do:

resp <- safe_GET("http://example.com") # you can use all legal `GET` params

And you can test resp$result for NULL. Put that into your retry loop and you're good to go.

You can see this in action by doing:

str(safe_GET("https://httpbin.org/delay/3", timeout(1)))

which will ask the httpbin service to wait 3s before responding but set an explicit timeout on the GET request to 1s. I wrapped it in str() to show the result:

List of 2
 $ result: NULL
 $ error :List of 2
  ..$ message: chr "Timeout was reached"
  ..$ call   : language curl::curl_fetch_memory(url, handle = handle)
  ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

So, you can even check the message if you need to.




回答2:


http_status(h1) can help you know where the problem lies :

a <- http_status(GET("http://google.com"))
a

$category
[1] "Success"

$reason
[1] "OK"

$message
[1] "Success: (200) OK"

and

b <- http_status(GET("http://google.com/blablablablaba"))
b

$category
[1] "Client error"

$reason
[1] "Not Found"

$message
[1] "Client error: (404) Not Found"

See this list of HTTP status codes to know what the code you get means.

Moreover, tryCatch can help you achieve what you want :

tryCatch({GET(h1)}, error = function(e){print("error")})


来源:https://stackoverflow.com/questions/37367918/how-to-refresh-or-retry-a-specific-web-page-using-httr-get-command

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!