How can I detect the encoding/codepage of a text file

后端未结

关注

 20  1407

In our application, we receive text files (.txt, .csv, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the

相关标签:

20条回答

小鲜肉

2020-11-21 23:04

If someone is looking for a 93.9% solution. This works for me:

public static class StreamExtension
{
    /// <summary>
    /// Convert the content to a string.
    /// </summary>
    /// <param name="stream">The stream.</param>
    /// <returns></returns>
    public static string ReadAsString(this Stream stream)
    {
        var startPosition = stream.Position;
        try
        {
            // 1. Check for a BOM
            // 2. or try with UTF-8. The most (86.3%) used encoding. Visit: http://w3techs.com/technologies/overview/character_encoding/all/
            var streamReader = new StreamReader(stream, new UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true), detectEncodingFromByteOrderMarks: true);
            return streamReader.ReadToEnd();
        }
        catch (DecoderFallbackException ex)
        {
            stream.Position = startPosition;

            // 3. The second most (6.7%) used encoding is ISO-8859-1. So use Windows-1252 (0.9%, also know as ANSI), which is a superset of ISO-8859-1.
            var streamReader = new StreamReader(stream, Encoding.GetEncoding(1252));
            return streamReader.ReadToEnd();
        }
    }
}

0 讨论(0)

感动是毒

2020-11-21 23:05

You can't detect the codepage

This is clearly false. Every web browser has some kind of universal charset detector to deal with pages which have no indication whatsoever of an encoding. Firefox has one. You can download the code and see how it does it. See some documentation here. Basically, it is a heuristic, but one that works really well.

Given a reasonable amount of text, it is even possible to detect the language.

Here's another one I just found using Google:

0 讨论(0)
发布评论:

提交评论
- 加载中...
谎友^

2020-11-21 23:06

If you're looking to detect non-UTF encodings (i.e. no BOM), you're basically down to heuristics and statistical analysis of the text. You might want to take a look at the Mozilla paper on universal charset detection (same link, with better formatting via Wayback Machine).

0 讨论(0)
发布评论:

提交评论
- 加载中...

醉酒成梦

2020-11-21 23:06

Looking for different solution, I found that

https://code.google.com/p/ude/

this solution is kinda heavy.

I needed some basic encoding detection, based on 4 first bytes and probably xml charset detection - so I've took some sample source code from internet and added slightly modified version of

http://lists.w3.org/Archives/Public/www-validator/2002Aug/0084.html

written for Java.

    public static Encoding DetectEncoding(byte[] fileContent)
    {
        if (fileContent == null)
            throw new ArgumentNullException();

        if (fileContent.Length < 2)
            return Encoding.ASCII;      // Default fallback

        if (fileContent[0] == 0xff
            && fileContent[1] == 0xfe
            && (fileContent.Length < 4
                || fileContent[2] != 0
                || fileContent[3] != 0
                )
            )
            return Encoding.Unicode;

        if (fileContent[0] == 0xfe
            && fileContent[1] == 0xff
            )
            return Encoding.BigEndianUnicode;

        if (fileContent.Length < 3)
            return null;

        if (fileContent[0] == 0xef && fileContent[1] == 0xbb && fileContent[2] == 0xbf)
            return Encoding.UTF8;

        if (fileContent[0] == 0x2b && fileContent[1] == 0x2f && fileContent[2] == 0x76)
            return Encoding.UTF7;

        if (fileContent.Length < 4)
            return null;

        if (fileContent[0] == 0xff && fileContent[1] == 0xfe && fileContent[2] == 0 && fileContent[3] == 0)
            return Encoding.UTF32;

        if (fileContent[0] == 0 && fileContent[1] == 0 && fileContent[2] == 0xfe && fileContent[3] == 0xff)
            return Encoding.GetEncoding(12001);

        String probe;
        int len = fileContent.Length;

        if( fileContent.Length >= 128 ) len = 128;
        probe = Encoding.ASCII.GetString(fileContent, 0, len);

        MatchCollection mc = Regex.Matches(probe, "^<\\?xml[^<>]*encoding[ \\t\\n\\r]?=[\\t\\n\\r]?['\"]([A-Za-z]([A-Za-z0-9._]|-)*)", RegexOptions.Singleline);
        // Add '[0].Groups[1].Value' to the end to test regex

        if( mc.Count == 1 && mc[0].Groups.Count >= 2 )
        {
            // Typically picks up 'UTF-8' string
            Encoding enc = null;

            try {
                enc = Encoding.GetEncoding( mc[0].Groups[1].Value );
            }catch (Exception ) { }

            if( enc != null )
                return enc;
        }

        return Encoding.ASCII;      // Default fallback
    }

It's enough to read probably first 1024 bytes from file, but I'm loading whole file.

0 讨论(0)

太阳男子

2020-11-21 23:09
10Y (!) had passed since this was asked, and still I see no mention of MS's good, non-GPL'ed solution: IMultiLanguage2 API.

Most libraries already mentioned are based on Mozilla's UDE - and it seems reasonable that browsers have already tackled similar problems. I don't know what is chrome's solution, but since IE 5.0 MS have released theirs, and it is:
1. Free of GPL-and-the-like licensing issues,
2. Backed and maintained probably forever,
3. Gives rich output - all valid candidates for encoding/codepages along with confidence scores,
4. Surprisingly easy to use (it is a single function call).
It is a native COM call, but here's some very nice work by Carsten Zeumer, that handles the interop mess for .net usage. There are some others around, but by and large this library doesn't get the attention it deserves.
0 讨论(0)
发布评论:

提交评论
- 加载中...
感情败类

2020-11-21 23:10
I use this code to detect Unicode and windows default ansi codepage when reading a file. For other codings a check of content is necessary, manually or by programming. This can de used to save the text with the same encoding as when it was opened. (I use VB.NET)
```
'Works for Default and unicode (auto detect)
Dim mystreamreader As New StreamReader(LocalFileName, Encoding.Default) 
MyEditTextBox.Text = mystreamreader.ReadToEnd()
Debug.Print(mystreamreader.CurrentEncoding.CodePage) 'Autodetected encoding
mystreamreader.Close()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 4 下一页