How can I detect the encoding/codepage of a text file

后端未结

关注

 20  1410

In our application, we receive text files (.txt, .csv, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the

相关标签:

20条回答

野趣味

2020-11-21 23:18

Notepad++ has this feature out-of-the-box. It also supports changing it.

0 讨论(0)
发布评论:

提交评论
- 加载中...
半阙折子戏

2020-11-21 23:21

If you can link to a C library, you can use libenca. See http://cihar.com/software/enca/. From the man page:

Enca reads given text files, or standard input when none are given, and uses knowledge about their language (must be supported by you) and a mixture of parsing, statistical analysis, guessing and black magic to determine their encodings.

It's GPL v2.

0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2020-11-21 23:25

Since it basically comes down to heuristics, it may help to use the encoding of previously received files from the same source as a first hint.

Most people (or applications) do stuff in pretty much the same order every time, often on the same machine, so its quite likely that when Bob creates a .csv file and sends it to Mary it'll always be using Windows-1252 or whatever his machine defaults to.

Where possible a bit of customer training never hurts either :-)

0 讨论(0)
发布评论:

提交评论
- 加载中...
悲&欢浪女

2020-11-21 23:27
I've done something similar in Python. Basically, you need lots of sample data from various encodings, which are broken down by a sliding two-byte window and stored in a dictionary (hash), keyed on byte-pairs providing values of lists of encodings.

Given that dictionary (hash), you take your input text and:
- if it starts with any BOM character ('\xfe\xff' for UTF-16-BE, '\xff\xfe' for UTF-16-LE, '\xef\xbb\xbf' for UTF-8 etc), I treat it as suggested
- if not, then take a large enough sample of the text, take all byte-pairs of the sample and choose the encoding that is the least common suggested from the dictionary.
If you've also sampled UTF encoded texts that do not start with any BOM, the second step will cover those that slipped from the first step.

So far, it works for me (the sample data and subsequent input data are subtitles in various languages) with diminishing error rates.
0 讨论(0)
发布评论:

提交评论
- 加载中...
长情又很酷

2020-11-21 23:30

The tool "uchardet" does this well using character frequency distribution models for each charset. Larger files and more "typical" files have more confidence (obviously).

On ubuntu, you just apt-get install uchardet.

On other systems, get the source, usage & docs here: https://github.com/BYVoid/uchardet

0 讨论(0)
发布评论:

提交评论
- 加载中...
广开言路

2020-11-21 23:30

I was actually looking for a generic, not programming way of detecting the file encoding, but I didn't find that yet. What I did find by testing with different encodings was that my text was UTF-7.

So where I first was doing: StreamReader file = File.OpenText(fullfilename);

I had to change it to: StreamReader file = new StreamReader(fullfilename, System.Text.Encoding.UTF7);

OpenText assumes it's UTF-8.

you can also create the StreamReader like this new StreamReader(fullfilename, true), the second parameter meaning that it should try and detect the encoding from the byteordermark of the file, but that didn't work in my case.

0 讨论(0)
发布评论:

提交评论
- 加载中...