How can I detect the encoding/codepage of a text file

后端 未结 20 1412
梦如初夏
梦如初夏 2020-11-21 22:42

In our application, we receive text files (.txt, .csv, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the

20条回答
  •  悲&欢浪女
    2020-11-21 23:27

    I've done something similar in Python. Basically, you need lots of sample data from various encodings, which are broken down by a sliding two-byte window and stored in a dictionary (hash), keyed on byte-pairs providing values of lists of encodings.

    Given that dictionary (hash), you take your input text and:

    • if it starts with any BOM character ('\xfe\xff' for UTF-16-BE, '\xff\xfe' for UTF-16-LE, '\xef\xbb\xbf' for UTF-8 etc), I treat it as suggested
    • if not, then take a large enough sample of the text, take all byte-pairs of the sample and choose the encoding that is the least common suggested from the dictionary.

    If you've also sampled UTF encoded texts that do not start with any BOM, the second step will cover those that slipped from the first step.

    So far, it works for me (the sample data and subsequent input data are subtitles in various languages) with diminishing error rates.

提交回复
热议问题