问题
I have a bunch of URLs extracted by text-mining some PDF documents. Now I want to test the URLS for validity. Some urls have junk characters inside or appended, or the URLS are truncated. One approach is to filter them by calling each of them.
To do that, I use the url.exists()
function from the RCurl package. The function makes HTTP HEAD requests to urls using curl and checks the status code.
From the documentation of ?url.exists
If ‘.header’ is ‘FALSE’, this returns ‘TRUE’ or ‘FALSE’ indicating
whether the request was successful (had a status with a value in
the 200 range).
How can I make it return TRUE for urls that issue a redirect? Redirect status codes are in the 300 range. They are not really errors.
Or is there a better way? grabbing the actual status codes and process them manually? Should I use a system command here?
来源:https://stackoverflow.com/questions/15343560/rcurlurl-exists-how-to-get-non-error-for-redirects-in-the-300-range-of-ht