Java Text File Encoding

后端 未结 4 1881
逝去的感伤
逝去的感伤 2020-12-17 18:54

I have a text file and it can be ANSI (with ISO-8859-2 charset), UTF-8, UCS-2 Big or Little Endian.

Is there any way to detect the encoding of the file to read it pr

相关标签:
4条回答
  • 2020-12-17 19:01

    If your text file is a properly created Unicode text file then the Byte Order Mark (BOM) should tell you all the information you need. See here for more details about BOM

    If it's not then you'll have to use some encoding detection library.

    0 讨论(0)
  • 2020-12-17 19:10

    You can use ICU4J (http://icu-project.org/apiref/icu4j/)

    Here is my code:

                String charset = "ISO-8859-1"; //Default chartset, put whatever you want
    
                byte[] fileContent = null;
                FileInputStream fin = null;
    
                //create FileInputStream object
                fin = new FileInputStream(file.getPath());
    
                /*
                 * Create byte array large enough to hold the content of the file.
                 * Use File.length to determine size of the file in bytes.
                 */
                fileContent = new byte[(int) file.length()];
    
                /*
                 * To read content of the file in byte array, use
                 * int read(byte[] byteArray) method of java FileInputStream class.
                 *
                 */
                fin.read(fileContent);
    
                byte[] data =  fileContent;
    
                CharsetDetector detector = new CharsetDetector();
                detector.setText(data);
    
                CharsetMatch cm = detector.detect();
    
                if (cm != null) {
                    int confidence = cm.getConfidence();
                    System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%");
                    //Here you have the encode name and the confidence
                    //In my case if the confidence is > 50 I return the encode, else I return the default value
                    if (confidence > 50) {
                        charset = cm.getName();
                    }
                }
    

    Remember to put all the try catch need it.

    I hope this works for you.

    0 讨论(0)
  • 2020-12-17 19:13

    Yes, there's a number of methods to do character encoding detection, specifically in Java. Take a look at jchardet which is based on the Mozilla algorithm. There's also cpdetector and a project by IBM called ICU4j. I'd take a look at the latter, as it seems to be more reliable than the other two. They work based on statistical analysis of the binary file, ICU4j will also provide a confidence level of the character encoding it detects so you can use this in the case above. It works pretty well.

    0 讨论(0)
  • 2020-12-17 19:16

    UTF-8 and UCS-2/UTF-16 can be distinguished reasonably easily via a byte order mark at the start of the file. If this exists then it's a pretty good bet that the file is in that encoding - but it's not a dead certainty. You may well also find that the file is in one of those encodings, but doesn't have a byte order mark.

    I don't know much about ISO-8859-2, but I wouldn't be surprised if almost every file is a valid text file in that encoding. The best you'll be able to do is check it heuristically. Indeed, the Wikipedia page talking about it would suggest that only byte 0x7f is invalid.

    There's no idea of reading a file "as it is" and yet getting text out - a file is a sequence of bytes, so you have to apply a character encoding in order to decode those bytes into characters.

    0 讨论(0)
提交回复
热议问题