In our application, we receive text files (.txt
, .csv
, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the
I've done something similar in Python. Basically, you need lots of sample data from various encodings, which are broken down by a sliding two-byte window and stored in a dictionary (hash), keyed on byte-pairs providing values of lists of encodings.
Given that dictionary (hash), you take your input text and:
If you've also sampled UTF encoded texts that do not start with any BOM, the second step will cover those that slipped from the first step.
So far, it works for me (the sample data and subsequent input data are subtitles in various languages) with diminishing error rates.