How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?

前端 未结 4 1136
野趣味
野趣味 2020-11-30 00:15

My program has to read files that use various encodings. They may be ANSI, UTF-8 or UTF-16 (big or little endian).

When the BOM (Byte Order Mark) is there, I have n

相关标签:
4条回答
  • 2020-11-30 00:36

    Maybe you can shell out to a Python script that uses Chardet: Universal Encoding Detector. It is a reimplementation of the character encoding detection that used by Firefox, and is used by many different applications. Useful links: Mozilla's code, research paper it was based on (ironically, my Firefox fails to correctly detect the encoding of that page), short explanation, detailed explanation.

    0 讨论(0)
  • 2020-11-30 00:41

    My guess is:

    • First, check if the file has byte values less than 32 (except for tab/newlines). If it does, it can't be ANSI or UTF-8. Thus - UTF-16. Just have to figure out the endianness. For this you should probably use some table of valid Unicode character codes. If you encounter invalid codes, try the other endianness if that fits. If either fit (or don't), check which one has larger percentage of alphanumeric codes. Also you might try searchung for line breaks and determine endianness from them. Other than that, I have no ideas how to check for endianness.
    • If the file contains no values less than 32 (apart from said whitespace), it's probably ANSI or UTF-8. Try parsing it as UTF-8 and see if you get any invalid Unicode characters. If you do, it's probably ANSI.
    • If you expect documents in non-English single-byte or multi-byte non-Unicode encodings, then you're out of luck. Best thing you can do is something like Internet Explorer which makes a histogram of character values and compares it to histograms of known languages. It works pretty often, but sometimes fails too. And you'll have to have a large library of letter histograms for every language.
    0 讨论(0)
  • 2020-11-30 00:41

    ASCII? No modern OS uses ASCII any more. They all use 8 bit codes, at least, meaning it's either UTF-8, ISOLatinX, WinLatinX, MacRoman, Shift-JIS or whatever else is out there.

    The only test I know of is to check for invalid UTF-8 chars. If you find any, then you know it can't be UTF-8. Same is probably possible for UTF-16. But when it's no Unicode set, then it'll be hard to tell which Windows code page it might be.

    Most editors I know deal with this by letting the user choose a default from the list of all possible encodings.

    There is code out there for checking validity of UTF chars.

    0 讨论(0)
  • 2020-11-30 00:43

    Here is how notepad does that

    There is also the python Universal Encoding Detector which you can check.

    0 讨论(0)
提交回复
热议问题