In our application, we receive text files (.txt
, .csv
, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the
Notepad++ has this feature out-of-the-box. It also supports changing it.
If you can link to a C library, you can use libenca
. See http://cihar.com/software/enca/. From the man page:
Enca reads given text files, or standard input when none are given, and uses knowledge about their language (must be supported by you) and a mixture of parsing, statistical analysis, guessing and black magic to determine their encodings.
It's GPL v2.
Since it basically comes down to heuristics, it may help to use the encoding of previously received files from the same source as a first hint.
Most people (or applications) do stuff in pretty much the same order every time, often on the same machine, so its quite likely that when Bob creates a .csv file and sends it to Mary it'll always be using Windows-1252 or whatever his machine defaults to.
Where possible a bit of customer training never hurts either :-)
I've done something similar in Python. Basically, you need lots of sample data from various encodings, which are broken down by a sliding two-byte window and stored in a dictionary (hash), keyed on byte-pairs providing values of lists of encodings.
Given that dictionary (hash), you take your input text and:
If you've also sampled UTF encoded texts that do not start with any BOM, the second step will cover those that slipped from the first step.
So far, it works for me (the sample data and subsequent input data are subtitles in various languages) with diminishing error rates.
The tool "uchardet" does this well using character frequency distribution models for each charset. Larger files and more "typical" files have more confidence (obviously).
On ubuntu, you just apt-get install uchardet
.
On other systems, get the source, usage & docs here: https://github.com/BYVoid/uchardet
I was actually looking for a generic, not programming way of detecting the file encoding, but I didn't find that yet. What I did find by testing with different encodings was that my text was UTF-7.
So where I first was doing: StreamReader file = File.OpenText(fullfilename);
I had to change it to: StreamReader file = new StreamReader(fullfilename, System.Text.Encoding.UTF7);
OpenText assumes it's UTF-8.
you can also create the StreamReader like this new StreamReader(fullfilename, true), the second parameter meaning that it should try and detect the encoding from the byteordermark of the file, but that didn't work in my case.