Java Text File Encoding

后端未结

关注

 4  1881

逝去的感伤

I have a text file and it can be ANSI (with ISO-8859-2 charset), UTF-8, UCS-2 Big or Little Endian.

Is there any way to detect the encoding of the file to read it pr

相关标签:

4条回答

有刺的猬

2020-12-17 19:01

If your text file is a properly created Unicode text file then the Byte Order Mark (BOM) should tell you all the information you need. See here for more details about BOM

If it's not then you'll have to use some encoding detection library.

0 讨论(0)
发布评论:

提交评论
- 加载中...

耶瑟儿～

2020-12-17 19:10

You can use ICU4J (http://icu-project.org/apiref/icu4j/)

Here is my code:

            String charset = "ISO-8859-1"; //Default chartset, put whatever you want

            byte[] fileContent = null;
            FileInputStream fin = null;

            //create FileInputStream object
            fin = new FileInputStream(file.getPath());

            /*
             * Create byte array large enough to hold the content of the file.
             * Use File.length to determine size of the file in bytes.
             */
            fileContent = new byte[(int) file.length()];

            /*
             * To read content of the file in byte array, use
             * int read(byte[] byteArray) method of java FileInputStream class.
             *
             */
            fin.read(fileContent);

            byte[] data =  fileContent;

            CharsetDetector detector = new CharsetDetector();
            detector.setText(data);

            CharsetMatch cm = detector.detect();

            if (cm != null) {
                int confidence = cm.getConfidence();
                System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%");
                //Here you have the encode name and the confidence
                //In my case if the confidence is > 50 I return the encode, else I return the default value
                if (confidence > 50) {
                    charset = cm.getName();
                }
            }

Remember to put all the try catch need it.

I hope this works for you.

0 讨论(0)

清酒与你

2020-12-17 19:13

Yes, there's a number of methods to do character encoding detection, specifically in Java. Take a look at jchardet which is based on the Mozilla algorithm. There's also cpdetector and a project by IBM called ICU4j. I'd take a look at the latter, as it seems to be more reliable than the other two. They work based on statistical analysis of the binary file, ICU4j will also provide a confidence level of the character encoding it detects so you can use this in the case above. It works pretty well.

0 讨论(0)
发布评论:

提交评论
- 加载中...
故里飘歌

2020-12-17 19:16

UTF-8 and UCS-2/UTF-16 can be distinguished reasonably easily via a byte order mark at the start of the file. If this exists then it's a pretty good bet that the file is in that encoding - but it's not a dead certainty. You may well also find that the file is in one of those encodings, but doesn't have a byte order mark.

I don't know much about ISO-8859-2, but I wouldn't be surprised if almost every file is a valid text file in that encoding. The best you'll be able to do is check it heuristically. Indeed, the Wikipedia page talking about it would suggest that only byte 0x7f is invalid.

There's no idea of reading a file "as it is" and yet getting text out - a file is a sequence of bytes, so you have to apply a character encoding in order to decode those bytes into characters.

0 讨论(0)
发布评论:

提交评论
- 加载中...