URLConnection cannot retrive complete Html

三世轮回 提交于 2019-12-24 06:58:09

问题


I try to parse information from website. However, It works only when the context is not very long. As the Html goes large, the content loaded is incomplete. The total length of the retrieved String is around 40000. The count of the string retrieved each time is different. (ex: That is like 31345 count for the first time and 31358 next time) So I can not retrieve full page.

As the result, I assume this problem could be related to internet connection or buffer. But I have used the bufferedReader, and as far as I know HttpURLConnection work like a stream, so there should not have any problem. I have check almost all page related to UrlConnection, but there is no one talks about this.

Is there anything wrong in my code? I have been working on this problem for a few days, Any advice will be very helpful. Thanks in advance.

public String getHtmlFromUrl(String url, int startReadingLine) {
    String xml = "";

    try {

        //URL url1 = new URL(url);
        URL url1 = new URL("http://support.google.com/analytics/bin/answer.py?hl=zh-Hant&answer=1009602");

        HttpURLConnection urlConn = (HttpURLConnection) url1
                .openConnection();

        urlConn.setRequestProperty("User-Agent",
                "Mozilla/5.0 (Windows NT 6.1;zh-tw; MSIE 6.0)");
        if (Integer.parseInt(Build.VERSION.SDK) < Build.VERSION_CODES.FROYO) {
            System.setProperty("http.keepAlive", "false");
        }
        urlConn.setReadTimeout(10000 /* milliseconds */);
        urlConn.setConnectTimeout(15000 /* milliseconds */);
        urlConn.setDoOutput(true);
        urlConn.setDoInput(true);
        urlConn.setRequestMethod("GET");
        urlConn.setUseCaches(false);


        InputStreamReader in = new InputStreamReader(
                urlConn.getInputStream());
        BufferedReader buffer = new BufferedReader(in, 100000);

        StringBuilder builder = new StringBuilder();
        String auxaux = "";



        while ((aux = buffer.readLine()) != null)
            builder.append(aux);

        xml = builder.toString();

        in.close();
        urlConn.disconnect();

    } catch (SocketTimeoutException e) {
        return "time out";
    } catch (IOException e) {
        e.printStackTrace();
    }
    // return XML
    return xml;
}

Here is the example of xml: (count to be 40710)

(I did not add the "..." at end of xml)

<!DOCTYPE html><html lang="zh-Hant"class="streamlined streamlined-3"><head><script type="text/javascript">serverResponseTimeDelta=window.external&&window.external.pageT?window.external.pageT:-1;pageStartTime=new Date().getTime...

   ...

 ..."納米比亞", "NR": "諾魯", "NP": "尼泊爾", "NL": "荷蘭", "AN": "荷屬安地列斯", "KN": "尼維斯", "NC": "新喀里多尼亞", "NI": "尼加拉瓜", "NE": "尼日", "NG": "奈及利亞", "NU": "紐埃", "KR": "北韓", "NO": "挪威", "NZ": "紐西蘭", "OM": "阿曼", "PW": "帛琉", "PK": "巴基斯坦", "PS": "巴勒斯坦", "PA": "巴拿馬", "PG": "巴布亞新幾內亞", "PY": "巴拉圭", "PE": "秘魯", "PH"...

Another: (count 41106)

<!DOCTYPE html><html lang="zh-Hant"class="streamlined streamlined-3"><head><script type="text/javascript">serverResponseTimeDelta=window.external&&window.external.pageT?window.externa...

    ...

...屬安地列斯", "KN": "尼維斯", "NC": "新喀里多尼亞", "NI": "尼加拉瓜", "NE": "尼日", "NG": "奈及利亞", "NU": "紐埃", "KR": "北韓", "NO": "挪威", "NZ": "紐西蘭", "OM": "阿曼", "PW": "帛琉", "PK": "巴基斯坦", "PS": "巴勒斯坦", "PA": "巴拿馬", "PG": "巴布亞新幾內亞", "PY": "巴拉圭", "PE": "秘魯", "PH"...

edit: So Far I assume it have something to do with the way it interact with the internet since the count of each result is different, or it could be some weird bug of my device. The root cause is yet to be found. What is the weirdest part is that it ends with "..." in the result. It appears that it knows the result is not complete yet...


回答1:


Always try to write your Input into a external File and look what you actually receive! I had the same Problem on Android too. In the End, logcat didn´t show me the whole String!




回答2:


You can try the code below.

BufferedInputStream bis = new BufferedInputStream(in);
ByteArrayOutputStream buf = new ByteArrayOutputStream();
int result = bis.read();
while(result != -1) {
  byte b = (byte)result;
  buf.write(b);
  result = bis.read();
}        
return buf.toString();

otherwise:

       Writer writer = new StringWriter();

        char[] buffer = new char[1024];
        try {
            Reader reader = new BufferedReader(
                    new InputStreamReader(is, "UTF-8"));
            int n;
            while ((n = reader.read(buffer)) != -1) {
                writer.write(buffer, 0, n);
            }
        } finally {
            is.close();
        }
        return writer.toString();

Last method that I currently use is:

    URL u=null;
    InputStream is = null;
    DataInputStream dis;
    StringBuffer outData = new StringBuffer();
    try {
        u = new URL(url);
        is = u.openStream();
        dis = new DataInputStream(new BufferedInputStream(is));
        String app = null;
        while ((app = dis.readLine()) != null) {
            outData = outData.append(app);
        }
    } catch (MalformedURLException ex) {
        Log.e(TAG, "Malformed URL Exception", ex);
        return null;
    } catch (IOException ex) {
        Log.e(TAG, "Error stream ", ex);
        return null;
    } finally {
        try {
            is.close();
        } catch (IOException ioe) {
        }
    }
    return outData.toString();


来源:https://stackoverflow.com/questions/14455986/urlconnection-cannot-retrive-complete-html

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!