How can I detect the encoding/codepage of a text file

后端 未结 20 1390
梦如初夏
梦如初夏 2020-11-21 22:42

In our application, we receive text files (.txt, .csv, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the

20条回答
  •  太阳男子
    2020-11-21 23:09

    10Y (!) had passed since this was asked, and still I see no mention of MS's good, non-GPL'ed solution: IMultiLanguage2 API.

    Most libraries already mentioned are based on Mozilla's UDE - and it seems reasonable that browsers have already tackled similar problems. I don't know what is chrome's solution, but since IE 5.0 MS have released theirs, and it is:

    1. Free of GPL-and-the-like licensing issues,
    2. Backed and maintained probably forever,
    3. Gives rich output - all valid candidates for encoding/codepages along with confidence scores,
    4. Surprisingly easy to use (it is a single function call).

    It is a native COM call, but here's some very nice work by Carsten Zeumer, that handles the interop mess for .net usage. There are some others around, but by and large this library doesn't get the attention it deserves.

提交回复
热议问题