How can I detect the encoding/codepage of a text file

后端未结

关注

 20  1409

In our application, we receive text files (.txt, .csv, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the

相关标签:

20条回答

不要未来只要你来

2020-11-21 23:13

You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results. I can't find it now, but I'm sure Notepad can be tricked into displaying English text in Chinese.

Anyway, this is what you need to read: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Specifically Joel says:

The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII. There Ain't No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

0 讨论(0)
发布评论:

提交评论
- 加载中...

梦毁少年i

2020-11-21 23:13

Have you tried C# port for Mozilla Universal Charset Detector

Example from http://code.google.com/p/ude/

public static void Main(String[] args)
{
    string filename = args[0];
    using (FileStream fs = File.OpenRead(filename)) {
        Ude.CharsetDetector cdet = new Ude.CharsetDetector();
        cdet.Feed(fs);
        cdet.DataEnd();
        if (cdet.Charset != null) {
            Console.WriteLine("Charset: {0}, confidence: {1}", 
                 cdet.Charset, cdet.Confidence);
        } else {
            Console.WriteLine("Detection failed.");
        }
    }
}

0 讨论(0)

日久生厌

2020-11-21 23:13

Got the same problem but didn't found a good solution yet for detecting it automatically . Now im using PsPad (www.pspad.com) for that ;) Works fine

0 讨论(0)
发布评论:

提交评论
- 加载中...

情歌与酒

2020-11-21 23:15

As addon to ITmeze post, I've used this function to convert the output of C# port for Mozilla Universal Charset Detector

    private Encoding GetEncodingFromString(string codePageName)
    {
        try
        {
            return Encoding.GetEncoding(codePageName);
        }
        catch
        {
            return Encoding.ASCII;
        }
    }

MSDN

0 讨论(0)

栀梦

2020-11-21 23:16
I know it's very late for this question and this solution won't appeal to some (because of its english-centric bias and its lack of statistical/empirical testing), but it's worked very well for me, especially for processing uploaded CSV data:

http://www.architectshack.com/TextFileEncodingDetector.ashx

Advantages:
- BOM detection built-in
- Default/fallback encoding customizable
- pretty reliable (in my experience) for western-european-based files containing some exotic data (eg french names) with a mixture of UTF-8 and Latin-1-style files - basically the bulk of US and western european environments.
Note: I'm the one who wrote this class, so obviously take it with a grain of salt! :)
0 讨论(0)
发布评论:

提交评论
- 加载中...
萌比男神i

2020-11-21 23:16

Thanks @Erik Aronesty for mentioning uchardet.

Meanwhile the (same?) tool exists for linux: chardet.
Or, on cygwin you may want to use: chardetect.

See: chardet man page: https://www.commandlinux.com/man-page/man1/chardetect.1.html

This will heuristically detect (guess) the character encoding for each given file and will report the name and confidence level for each file's detected character encoding.

0 讨论(0)
发布评论:

提交评论
- 加载中...