Java : How to determine the correct charset encoding of a stream

前端 未结 15 1683
花落未央
花落未央 2020-11-22 02:06

With reference to the following thread: Java App : Unable to read iso-8859-1 encoded file correctly

What is the best way to programatically determine the correct cha

相关标签:
15条回答
  • 2020-11-22 02:29

    As far as I know, there is no general library in this context to be suitable for all types of problems. So, for each problem you should test the existing libraries and select the best one which satisfies your problem’s constraints, but often none of them is appropriate. In these cases you can write your own Encoding Detector! As I have wrote ...

    I’ve wrote a meta java tool for detecting charset encoding of HTML Web pages, using IBM ICU4j and Mozilla JCharDet as the built-in components. Here you can find my tool, please read the README section before anything else. Also, you can find some basic concepts of this problem in my paper and in its references.

    Bellow I provided some helpful comments which I’ve experienced in my work:

    • Charset detection is not a foolproof process, because it is essentially based on statistical data and what actually happens is guessing not detecting
    • icu4j is the main tool in this context by IBM, imho
    • Both TikaEncodingDetector and Lucene-ICU4j are using icu4j and their accuracy had not a meaningful difference from which the icu4j in my tests (at most %1, as I remember)
    • icu4j is much more general than jchardet, icu4j is just a bit biased to IBM family encodings while jchardet is strongly biased to utf-8
    • Due to the widespread use of UTF-8 in HTML-world; jchardet is a better choice than icu4j in overall, but is not the best choice!
    • icu4j is great for East Asian specific encodings like EUC-KR, EUC-JP, SHIFT_JIS, BIG5 and the GB family encodings
    • Both icu4j and jchardet are debacle in dealing with HTML pages with Windows-1251 and Windows-1256 encodings. Windows-1251 aka cp1251 is widely used for Cyrillic-based languages like Russian and Windows-1256 aka cp1256 is widely used for Arabic
    • Almost all encoding detection tools are using statistical methods, so the accuracy of output strongly depends on the size and the contents of the input
    • Some encodings are essentially the same just with a partial differences, so in some cases the guessed or detected encoding may be false but at the same time be true! As about Windows-1252 and ISO-8859-1. (refer to the last paragraph under the 5.2 section of my paper)
    0 讨论(0)
  • 2020-11-22 02:35

    check this out: http://site.icu-project.org/ (icu4j) they have libraries for detecting charset from IOStream could be simple like this:

    BufferedInputStream bis = new BufferedInputStream(input);
    CharsetDetector cd = new CharsetDetector();
    cd.setText(bis);
    CharsetMatch cm = cd.detect();
    
    if (cm != null) {
       reader = cm.getReader();
       charset = cm.getName();
    }else {
       throw new UnsupportedCharsetException()
    }
    
    0 讨论(0)
  • 2020-11-22 02:35

    In plain Java:

    final String[] encodings = { "US-ASCII", "ISO-8859-1", "UTF-8", "UTF-16BE", "UTF-16LE", "UTF-16" };
    
    List<String> lines;
    
    for (String encoding : encodings) {
        try {
            lines = Files.readAllLines(path, Charset.forName(encoding));
            for (String line : lines) {
                // do something...
            }
            break;
        } catch (IOException ioe) {
            System.out.println(encoding + " failed, trying next.");
        }
    }
    

    This approach will try the encodings one by one until one works or we run out of them. (BTW my encodings list has only those items because they are the charsets implementations required on every Java platform, https://docs.oracle.com/javase/9/docs/api/java/nio/charset/Charset.html)

    0 讨论(0)
提交回复
热议问题