问题
I'm using Apache HttpComponents to GET some web pages for some crawled URLs. Many of those URLs actually redirect to different URLs (e.g. because they have been processed with a URL shortener). Additionally to downloading the content, I would like to resolve the final URLs (i.e. the URL which provided the downloaded content), or even better, all URLs in the redirect chain.
I have been looking through the API docs, but got no clue, where I could hook. Any hints would be greatly appreciated.
回答1:
Here's a full demo of how to do it using Apache HttpComponents.
Important Details
You'll need to extend DefaultRedirectStrategy
like so:
class SpyStrategy extends DefaultRedirectStrategy {
public final Deque<URI> history = new LinkedList<>();
public SpyStrategy(URI uri) {
history.push(uri);
}
@Override
public HttpUriRequest getRedirect(
HttpRequest request,
HttpResponse response,
HttpContext context) throws ProtocolException {
HttpUriRequest redirect = super.getRedirect(request, response, context);
history.push(redirect.getURI());
return redirect;
}
}
expand
method sends a HEAD request which causes client
to collect URIs in spy.history
deque as it follows redirects automatically:
public static Deque<URI> expand(String uri) {
try {
HttpHead head = new HttpHead(uri);
SpyStrategy spy = new SpyStrategy(head.getURI());
DefaultHttpClient client = new DefaultHttpClient();
client.setRedirectStrategy(spy);
// FIXME: the following completely ignores HTTP errors:
client.execute(head);
return spy.history;
}
catch (IOException e) {
throw new RuntimeException(e);
}
}
You may want to set maximum number of redirects followed to something reasonable (instead of the default of 100) like so:
BasicHttpParams params = new BasicHttpParams();
params.setIntParameter(ClientPNames.MAX_REDIRECTS, 5);
DefaultHttpClient client = new DefaultHttpClient(params);
回答2:
One way is to turn off automatic redirect handling by setting the relevant parameter, and do it yourself by checking for 3xx responses, and manually extracting the redirect location from the responses "Location" header.
来源:https://stackoverflow.com/questions/11176486/getting-redirected-url-in-apache-httpcomponents