Java : How to determine the correct charset encoding of a stream

前端 未结 15 1681
花落未央
花落未央 2020-11-22 02:06

With reference to the following thread: Java App : Unable to read iso-8859-1 encoded file correctly

What is the best way to programatically determine the correct cha

相关标签:
15条回答
  • 2020-11-22 02:22

    You cannot determine the encoding of a arbitrary byte stream. This is the nature of encodings. A encoding means a mapping between a byte value and its representation. So every encoding "could" be the right.

    The getEncoding() method will return the encoding which was set up (read the JavaDoc) for the stream. It will not guess the encoding for you.

    Some streams tell you which encoding was used to create them: XML, HTML. But not an arbitrary byte stream.

    Anyway, you could try to guess an encoding on your own if you have to. Every language has a common frequency for every char. In English the char e appears very often but ê will appear very very seldom. In a ISO-8859-1 stream there are usually no 0x00 chars. But a UTF-16 stream has a lot of them.

    Or: you could ask the user. I've already seen applications which present you a snippet of the file in different encodings and ask you to select the "correct" one.

    0 讨论(0)
  • 2020-11-22 02:22

    Can you pick the appropriate char set in the Constructor:

    new InputStreamReader(new FileInputStream(in), "ISO8859_1");
    
    0 讨论(0)
  • 2020-11-22 02:23

    You can certainly validate the file for a particular charset by decoding it with a CharsetDecoder and watching out for "malformed-input" or "unmappable-character" errors. Of course, this only tells you if a charset is wrong; it doesn't tell you if it is correct. For that, you need a basis of comparison to evaluate the decoded results, e.g. do you know beforehand if the characters are restricted to some subset, or whether the text adheres to some strict format? The bottom line is that charset detection is guesswork without any guarantees.

    0 讨论(0)
  • 2020-11-22 02:24

    I found a nice third party library which can detect actual encoding: http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

    I didn't test it extensively but it seems to work.

    0 讨论(0)
  • 2020-11-22 02:24

    For ISO8859_1 files, there is not an easy way to distinguish them from ASCII. For Unicode files however one can generally detect this based on the first few bytes of the file.

    UTF-8 and UTF-16 files include a Byte Order Mark (BOM) at the very beginning of the file. The BOM is a zero-width non-breaking space.

    Unfortunately, for historical reasons, Java does not detect this automatically. Programs like Notepad will check the BOM and use the appropriate encoding. Using unix or Cygwin, you can check the BOM with the file command. For example:

    $ file sample2.sql 
    sample2.sql: Unicode text, UTF-16, big-endian
    

    For Java, I suggest you check out this code, which will detect the common file formats and select the correct encoding: How to read a file and automatically specify the correct encoding

    0 讨论(0)
  • 2020-11-22 02:28

    Which library to use?

    As of this writing, they are three libraries that emerge:

    • GuessEncoding
    • ICU4j
    • juniversalchardet

    I don't include Apache Any23 because it uses ICU4j 3.4 under the hood.

    How to tell which one has detected the right charset (or as close as possible)?

    It's impossible to certify the charset detected by each above libraries. However, it's possible to ask them in turn and score the returned response.

    How to score the returned response?

    Each response can be assigned one point. The more points a response have, the more confidence the detected charset has. This is a simple scoring method. You can elaborate others.

    Is there any sample code?

    Here is a full snippet implementing the strategy described in the previous lines.

    public static String guessEncoding(InputStream input) throws IOException {
        // Load input data
        long count = 0;
        int n = 0, EOF = -1;
        byte[] buffer = new byte[4096];
        ByteArrayOutputStream output = new ByteArrayOutputStream();
    
        while ((EOF != (n = input.read(buffer))) && (count <= Integer.MAX_VALUE)) {
            output.write(buffer, 0, n);
            count += n;
        }
        
        if (count > Integer.MAX_VALUE) {
            throw new RuntimeException("Inputstream too large.");
        }
    
        byte[] data = output.toByteArray();
    
        // Detect encoding
        Map<String, int[]> encodingsScores = new HashMap<>();
    
        // * GuessEncoding
        updateEncodingsScores(encodingsScores, new CharsetToolkit(data).guessEncoding().displayName());
    
        // * ICU4j
        CharsetDetector charsetDetector = new CharsetDetector();
        charsetDetector.setText(data);
        charsetDetector.enableInputFilter(true);
        CharsetMatch cm = charsetDetector.detect();
        if (cm != null) {
            updateEncodingsScores(encodingsScores, cm.getName());
        }
    
        // * juniversalchardset
        UniversalDetector universalDetector = new UniversalDetector(null);
        universalDetector.handleData(data, 0, data.length);
        universalDetector.dataEnd();
        String encodingName = universalDetector.getDetectedCharset();
        if (encodingName != null) {
            updateEncodingsScores(encodingsScores, encodingName);
        }
    
        // Find winning encoding
        Map.Entry<String, int[]> maxEntry = null;
        for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) {
            if (maxEntry == null || (e.getValue()[0] > maxEntry.getValue()[0])) {
                maxEntry = e;
            }
        }
    
        String winningEncoding = maxEntry.getKey();
        //dumpEncodingsScores(encodingsScores);
        return winningEncoding;
    }
    
    private static void updateEncodingsScores(Map<String, int[]> encodingsScores, String encoding) {
        String encodingName = encoding.toLowerCase();
        int[] encodingScore = encodingsScores.get(encodingName);
    
        if (encodingScore == null) {
            encodingsScores.put(encodingName, new int[] { 1 });
        } else {
            encodingScore[0]++;
        }
    }    
    
    private static void dumpEncodingsScores(Map<String, int[]> encodingsScores) {
        System.out.println(toString(encodingsScores));
    }
    
    private static String toString(Map<String, int[]> encodingsScores) {
        String GLUE = ", ";
        StringBuilder sb = new StringBuilder();
    
        for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) {
            sb.append(e.getKey() + ":" + e.getValue()[0] + GLUE);
        }
        int len = sb.length();
        sb.delete(len - GLUE.length(), len);
    
        return "{ " + sb.toString() + " }";
    }
    

    Improvements: The guessEncoding method reads the inputstream entirely. For large inputstreams this can be a concern. All these libraries would read the whole inputstream. This would imply a large time consumption for detecting the charset.

    It's possible to limit the initial data loading to a few bytes and perform the charset detection on those few bytes only.

    0 讨论(0)
提交回复
热议问题