With reference to the following thread: Java App : Unable to read iso-8859-1 encoded file correctly
What is the best way to programatically determine the correct cha
As far as I know, there is no general library in this context to be suitable for all types of problems. So, for each problem you should test the existing libraries and select the best one which satisfies your problem’s constraints, but often none of them is appropriate. In these cases you can write your own Encoding Detector! As I have wrote ...
I’ve wrote a meta java tool for detecting charset encoding of HTML Web pages, using IBM ICU4j and Mozilla JCharDet as the built-in components. Here you can find my tool, please read the README section before anything else. Also, you can find some basic concepts of this problem in my paper and in its references.
Bellow I provided some helpful comments which I’ve experienced in my work:
check this out: http://site.icu-project.org/ (icu4j) they have libraries for detecting charset from IOStream could be simple like this:
BufferedInputStream bis = new BufferedInputStream(input);
CharsetDetector cd = new CharsetDetector();
cd.setText(bis);
CharsetMatch cm = cd.detect();
if (cm != null) {
reader = cm.getReader();
charset = cm.getName();
}else {
throw new UnsupportedCharsetException()
}
In plain Java:
final String[] encodings = { "US-ASCII", "ISO-8859-1", "UTF-8", "UTF-16BE", "UTF-16LE", "UTF-16" };
List<String> lines;
for (String encoding : encodings) {
try {
lines = Files.readAllLines(path, Charset.forName(encoding));
for (String line : lines) {
// do something...
}
break;
} catch (IOException ioe) {
System.out.println(encoding + " failed, trying next.");
}
}
This approach will try the encodings one by one until one works or we run out of them. (BTW my encodings list has only those items because they are the charsets implementations required on every Java platform, https://docs.oracle.com/javase/9/docs/api/java/nio/charset/Charset.html)