Java : How to determine the correct charset encoding of a stream

前端 未结 15 1662
花落未央
花落未央 2020-11-22 02:06

With reference to the following thread: Java App : Unable to read iso-8859-1 encoded file correctly

What is the best way to programatically determine the correct cha

15条回答
  •  孤街浪徒
    2020-11-22 02:28

    Which library to use?

    As of this writing, they are three libraries that emerge:

    • GuessEncoding
    • ICU4j
    • juniversalchardet

    I don't include Apache Any23 because it uses ICU4j 3.4 under the hood.

    How to tell which one has detected the right charset (or as close as possible)?

    It's impossible to certify the charset detected by each above libraries. However, it's possible to ask them in turn and score the returned response.

    How to score the returned response?

    Each response can be assigned one point. The more points a response have, the more confidence the detected charset has. This is a simple scoring method. You can elaborate others.

    Is there any sample code?

    Here is a full snippet implementing the strategy described in the previous lines.

    public static String guessEncoding(InputStream input) throws IOException {
        // Load input data
        long count = 0;
        int n = 0, EOF = -1;
        byte[] buffer = new byte[4096];
        ByteArrayOutputStream output = new ByteArrayOutputStream();
    
        while ((EOF != (n = input.read(buffer))) && (count <= Integer.MAX_VALUE)) {
            output.write(buffer, 0, n);
            count += n;
        }
        
        if (count > Integer.MAX_VALUE) {
            throw new RuntimeException("Inputstream too large.");
        }
    
        byte[] data = output.toByteArray();
    
        // Detect encoding
        Map encodingsScores = new HashMap<>();
    
        // * GuessEncoding
        updateEncodingsScores(encodingsScores, new CharsetToolkit(data).guessEncoding().displayName());
    
        // * ICU4j
        CharsetDetector charsetDetector = new CharsetDetector();
        charsetDetector.setText(data);
        charsetDetector.enableInputFilter(true);
        CharsetMatch cm = charsetDetector.detect();
        if (cm != null) {
            updateEncodingsScores(encodingsScores, cm.getName());
        }
    
        // * juniversalchardset
        UniversalDetector universalDetector = new UniversalDetector(null);
        universalDetector.handleData(data, 0, data.length);
        universalDetector.dataEnd();
        String encodingName = universalDetector.getDetectedCharset();
        if (encodingName != null) {
            updateEncodingsScores(encodingsScores, encodingName);
        }
    
        // Find winning encoding
        Map.Entry maxEntry = null;
        for (Map.Entry e : encodingsScores.entrySet()) {
            if (maxEntry == null || (e.getValue()[0] > maxEntry.getValue()[0])) {
                maxEntry = e;
            }
        }
    
        String winningEncoding = maxEntry.getKey();
        //dumpEncodingsScores(encodingsScores);
        return winningEncoding;
    }
    
    private static void updateEncodingsScores(Map encodingsScores, String encoding) {
        String encodingName = encoding.toLowerCase();
        int[] encodingScore = encodingsScores.get(encodingName);
    
        if (encodingScore == null) {
            encodingsScores.put(encodingName, new int[] { 1 });
        } else {
            encodingScore[0]++;
        }
    }    
    
    private static void dumpEncodingsScores(Map encodingsScores) {
        System.out.println(toString(encodingsScores));
    }
    
    private static String toString(Map encodingsScores) {
        String GLUE = ", ";
        StringBuilder sb = new StringBuilder();
    
        for (Map.Entry e : encodingsScores.entrySet()) {
            sb.append(e.getKey() + ":" + e.getValue()[0] + GLUE);
        }
        int len = sb.length();
        sb.delete(len - GLUE.length(), len);
    
        return "{ " + sb.toString() + " }";
    }
    

    Improvements: The guessEncoding method reads the inputstream entirely. For large inputstreams this can be a concern. All these libraries would read the whole inputstream. This would imply a large time consumption for detecting the charset.

    It's possible to limit the initial data loading to a few bytes and perform the charset detection on those few bytes only.

提交回复
热议问题