Getting redirected URL in Apache HttpComponents

吃可爱长大的小学妹 提交于 2019-12-24 23:29:41

问题


I'm using Apache HttpComponents to GET some web pages for some crawled URLs. Many of those URLs actually redirect to different URLs (e.g. because they have been processed with a URL shortener). Additionally to downloading the content, I would like to resolve the final URLs (i.e. the URL which provided the downloaded content), or even better, all URLs in the redirect chain.

I have been looking through the API docs, but got no clue, where I could hook. Any hints would be greatly appreciated.


回答1:


Here's a full demo of how to do it using Apache HttpComponents.

Important Details

You'll need to extend DefaultRedirectStrategy like so:

class SpyStrategy extends DefaultRedirectStrategy {
    public final Deque<URI> history = new LinkedList<>();

    public SpyStrategy(URI uri) {
        history.push(uri);
    }

    @Override
    public HttpUriRequest getRedirect(
            HttpRequest request,
            HttpResponse response,
            HttpContext context) throws ProtocolException {
        HttpUriRequest redirect = super.getRedirect(request, response, context);
        history.push(redirect.getURI());
        return redirect;
    }
}

expand method sends a HEAD request which causes client to collect URIs in spy.history deque as it follows redirects automatically:

public static Deque<URI> expand(String uri) {
    try {
        HttpHead head = new HttpHead(uri);
        SpyStrategy spy = new SpyStrategy(head.getURI());
        DefaultHttpClient client = new DefaultHttpClient();
        client.setRedirectStrategy(spy);
        // FIXME: the following completely ignores HTTP errors:
        client.execute(head);
        return spy.history;
    }
    catch (IOException e) {
        throw new RuntimeException(e);
    }
}

You may want to set maximum number of redirects followed to something reasonable (instead of the default of 100) like so:

        BasicHttpParams params = new BasicHttpParams();
        params.setIntParameter(ClientPNames.MAX_REDIRECTS, 5);
        DefaultHttpClient client = new DefaultHttpClient(params);



回答2:


One way is to turn off automatic redirect handling by setting the relevant parameter, and do it yourself by checking for 3xx responses, and manually extracting the redirect location from the responses "Location" header.



来源:https://stackoverflow.com/questions/11176486/getting-redirected-url-in-apache-httpcomponents

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!