In our application, we receive text files (.txt
, .csv
, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the
The tool "uchardet" does this well using character frequency distribution models for each charset. Larger files and more "typical" files have more confidence (obviously).
On ubuntu, you just apt-get install uchardet
.
On other systems, get the source, usage & docs here: https://github.com/BYVoid/uchardet