Strange encoding behaviour with jsoup

前端 未结 1 2009
渐次进展
渐次进展 2021-01-06 16:14

I extract some information from the html sourcecode of different pages with jsoup. Most of them are UTF-8 encoded. One of them is encoded with ISO-8859-1, which leads to a s

相关标签:
1条回答
  • 2021-01-06 16:54

    This is a mistake of the website itself. It are actually three mistakes:

    1. The page is served without any charset in the HTTP Content-Type response header. There's ISO-8859-1 in the HTML meta tag, but this is ignored when the page is served over HTTP! The average webbrowser will either try smart detection or use platform default encoding to encode the webpage, which is CP1252 on Windows machines.

    2. The <meta> tag pretends that the content is ISO-8859-1 encoded, but the actual character (U+2013 EN DASH) is not covered by that charset at all. It is however covered by the CP1252 charset as 0x0096.

    3. According to the webpage source code, the product name uses the literal character instead of the HTML entity &ndash; as spotted elsewhere on the same webpage.

    Jsoup can fix many badly developed webpages transparently, but this one goes really beyond Jsoup. You need to manually read it in and then feed it as CP1252 to Jsoup.

    String url = "http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html";
    InputStream input = new URL(url).openStream();
    Document doc = Jsoup.parse(input, "CP1252", url);
    String title = doc.select(".products_name").first().text();
    // ...
    
    0 讨论(0)
提交回复
热议问题