Java : How to determine the correct charset encoding of a stream

前端未结

关注

 15  1679

花落未央

With reference to the following thread: Java App : Unable to read iso-8859-1 encoded file correctly

What is the best way to programatically determine the correct cha

相关标签:

15条回答

耶瑟儿～

2020-11-22 02:12

Here are my favorites:

TikaEncodingDetector

Dependency:

<dependency>
  <groupId>org.apache.any23</groupId>
  <artifactId>apache-any23-encoding</artifactId>
  <version>1.1</version>
</dependency>

Sample:

public static Charset guessCharset(InputStream is) throws IOException {
  return Charset.forName(new TikaEncodingDetector().guessEncoding(is));    
}

GuessEncoding

Dependency:

<dependency>
  <groupId>org.codehaus.guessencoding</groupId>
  <artifactId>guessencoding</artifactId>
  <version>1.4</version>
  <type>jar</type>
</dependency>

Sample:

  public static Charset guessCharset2(File file) throws IOException {
    return CharsetToolkit.guessEncoding(file, 4096, StandardCharsets.UTF_8);
  }

0 讨论(0)

花落未央

2020-11-22 02:17

The libs above are simple BOM detectors which of course only work if there is a BOM in the beginning of the file. Take a look at http://jchardet.sourceforge.net/ which does scans the text

0 讨论(0)
发布评论:

提交评论
- 加载中...
闹比i

2020-11-22 02:17
An alternative to TikaEncodingDetector is to use Tika AutoDetectReader.
```
Charset charset = new AutoDetectReader(new FileInputStream(file)).getCharset();
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

夕颜

2020-11-22 02:18

If you use ICU4J (http://icu-project.org/apiref/icu4j/)

Here is my code:

String charset = "ISO-8859-1"; //Default chartset, put whatever you want

byte[] fileContent = null;
FileInputStream fin = null;

//create FileInputStream object
fin = new FileInputStream(file.getPath());

/*
 * Create byte array large enough to hold the content of the file.
 * Use File.length to determine size of the file in bytes.
 */
fileContent = new byte[(int) file.length()];

/*
 * To read content of the file in byte array, use
 * int read(byte[] byteArray) method of java FileInputStream class.
 *
 */
fin.read(fileContent);

byte[] data =  fileContent;

CharsetDetector detector = new CharsetDetector();
detector.setText(data);

CharsetMatch cm = detector.detect();

if (cm != null) {
    int confidence = cm.getConfidence();
    System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%");
    //Here you have the encode name and the confidence
    //In my case if the confidence is > 50 I return the encode, else I return the default value
    if (confidence > 50) {
        charset = cm.getName();
    }
}

Remember to put all the try-catch need it.

I hope this works for you.

0 讨论(0)

野的像风

2020-11-22 02:20

If you don't know the encoding of your data, it is not so easy to determine, but you could try to use a library to guess it. Also, there is a similar question.

0 讨论(0)
发布评论:

提交评论
- 加载中...
长发绾君心

2020-11-22 02:22

I have used this library, similar to jchardet for detecting encoding in Java: http://code.google.com/p/juniversalchardet/

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 下一页