URLConnection is not allowing me to access data on Http errors (404,500,etc)

前端 未结 2 1603
夕颜
夕颜 2021-02-01 18:33

I am making a crawler, and need to get the data from the stream regardless if it is a 200 or not. CURL is doing it, as well as any standard browser.

The following will

相关标签:
2条回答
  • 2021-02-01 18:43

    Simple:

    URLConnection connection = url.openConnection();
    InputStream is = connection.getInputStream();
    if (connection instanceof HttpURLConnection) {
       HttpURLConnection httpConn = (HttpURLConnection) connection;
       int statusCode = httpConn.getResponseCode();
       if (statusCode != 200 /* or statusCode >= 200 && statusCode < 300 */) {
         is = httpConn.getErrorStream();
       }
    }
    

    You can refer to Javadoc for explanation. The best way I would handle this is as follows:

    URLConnection connection = url.openConnection();
    InputStream is = null;
    try {
        is = connection.getInputStream();
    } catch (IOException ioe) {
        if (connection instanceof HttpURLConnection) {
            HttpURLConnection httpConn = (HttpURLConnection) connection;
            int statusCode = httpConn.getResponseCode();
            if (statusCode != 200) {
                is = httpConn.getErrorStream();
            }
        }
    }
    
    0 讨论(0)
  • 2021-02-01 19:10

    You need to do the following after calling openConnection.

    1. Cast the URLConnection to HttpURLConnection

    2. Call getResponseCode

    3. If the response is a success, use getInputStream, otherwise use getErrorStream

    (The test for success should be 200 <= code < 300 because there are valid HTTP success codes apart from than 200.)


    I am making a crawler, and need to get the data from the stream regardless if it is a 200 or not.

    Just be aware that it if the code is a 4xx or 5xx, then the "data" is likely to be an error page of some kind.


    The final point that should be made is that you should always respect the "robots.txt" file ... and read the Terms of Service before crawling / scraping the content of a site whose owners might care. Simply blatting off GET requests is likely to annoy site owners ... unless you've already come to some sort of "arrangement" with them.

    0 讨论(0)
提交回复
热议问题