Java : How to determine the correct charset encoding of a stream

前端 未结 15 1656
花落未央
花落未央 2020-11-22 02:06

With reference to the following thread: Java App : Unable to read iso-8859-1 encoded file correctly

What is the best way to programatically determine the correct cha

相关标签:
15条回答
  • 2020-11-22 02:12

    Here are my favorites:

    TikaEncodingDetector

    Dependency:

    <dependency>
      <groupId>org.apache.any23</groupId>
      <artifactId>apache-any23-encoding</artifactId>
      <version>1.1</version>
    </dependency>
    

    Sample:

    public static Charset guessCharset(InputStream is) throws IOException {
      return Charset.forName(new TikaEncodingDetector().guessEncoding(is));    
    }
    

    GuessEncoding

    Dependency:

    <dependency>
      <groupId>org.codehaus.guessencoding</groupId>
      <artifactId>guessencoding</artifactId>
      <version>1.4</version>
      <type>jar</type>
    </dependency>
    

    Sample:

      public static Charset guessCharset2(File file) throws IOException {
        return CharsetToolkit.guessEncoding(file, 4096, StandardCharsets.UTF_8);
      }
    
    0 讨论(0)
  • 2020-11-22 02:17

    The libs above are simple BOM detectors which of course only work if there is a BOM in the beginning of the file. Take a look at http://jchardet.sourceforge.net/ which does scans the text

    0 讨论(0)
  • 2020-11-22 02:17

    An alternative to TikaEncodingDetector is to use Tika AutoDetectReader.

    Charset charset = new AutoDetectReader(new FileInputStream(file)).getCharset();
    
    0 讨论(0)
  • 2020-11-22 02:18

    If you use ICU4J (http://icu-project.org/apiref/icu4j/)

    Here is my code:

    String charset = "ISO-8859-1"; //Default chartset, put whatever you want
    
    byte[] fileContent = null;
    FileInputStream fin = null;
    
    //create FileInputStream object
    fin = new FileInputStream(file.getPath());
    
    /*
     * Create byte array large enough to hold the content of the file.
     * Use File.length to determine size of the file in bytes.
     */
    fileContent = new byte[(int) file.length()];
    
    /*
     * To read content of the file in byte array, use
     * int read(byte[] byteArray) method of java FileInputStream class.
     *
     */
    fin.read(fileContent);
    
    byte[] data =  fileContent;
    
    CharsetDetector detector = new CharsetDetector();
    detector.setText(data);
    
    CharsetMatch cm = detector.detect();
    
    if (cm != null) {
        int confidence = cm.getConfidence();
        System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%");
        //Here you have the encode name and the confidence
        //In my case if the confidence is > 50 I return the encode, else I return the default value
        if (confidence > 50) {
            charset = cm.getName();
        }
    }
    

    Remember to put all the try-catch need it.

    I hope this works for you.

    0 讨论(0)
  • 2020-11-22 02:20

    If you don't know the encoding of your data, it is not so easy to determine, but you could try to use a library to guess it. Also, there is a similar question.

    0 讨论(0)
  • 2020-11-22 02:22

    I have used this library, similar to jchardet for detecting encoding in Java: http://code.google.com/p/juniversalchardet/

    0 讨论(0)
提交回复
热议问题