问题
I need to access the same web page with different "keys" to get specific content it provides.
I have a list of keys x
and I use the GET
command from httr
package to access the web page and then retrieve the information I need y
.
library(httr)
library(stringr)
library(XML)
for (i in 1:20){
h1 = GET ( paste0("http:....categories=&query=", x[i]),timeout(10))
par = htmlParse(file = h1)
y[i]=xpathSApply(doc = par, path = "//h3/a" , fun=xmlValue)
}
The problem is that timeout is often reached, and it disrupts the loop.
So I would like to refresh the web page or retry the GET command if timeout is reached, because I suspect the problem is with the internet connection of the website I am trying to access.
The way my code works, timeout breaks the loop. I need to either ignore the error and go to next iteration or retry to access the website.
回答1:
Look at purrr::safely()
. You can wrap GET
as such:
safe_GET <- purrr::safely(GET)
This removes the ugliness of tryCatch()
by letting you do:
resp <- safe_GET("http://example.com") # you can use all legal `GET` params
And you can test resp$result
for NULL
. Put that into your retry loop and you're good to go.
You can see this in action by doing:
str(safe_GET("https://httpbin.org/delay/3", timeout(1)))
which will ask the httpbin service to wait 3s before responding but set an explicit timeout on the GET
request to 1s. I wrapped it in str()
to show the result:
List of 2
$ result: NULL
$ error :List of 2
..$ message: chr "Timeout was reached"
..$ call : language curl::curl_fetch_memory(url, handle = handle)
..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
So, you can even check the message if you need to.
回答2:
http_status(h1)
can help you know where the problem lies :
a <- http_status(GET("http://google.com"))
a
$category
[1] "Success"
$reason
[1] "OK"
$message
[1] "Success: (200) OK"
and
b <- http_status(GET("http://google.com/blablablablaba"))
b
$category
[1] "Client error"
$reason
[1] "Not Found"
$message
[1] "Client error: (404) Not Found"
See this list of HTTP status codes to know what the code you get means.
Moreover, tryCatch
can help you achieve what you want :
tryCatch({GET(h1)}, error = function(e){print("error")})
来源:https://stackoverflow.com/questions/37367918/how-to-refresh-or-retry-a-specific-web-page-using-httr-get-command