How can I detect the encoding/codepage of a text file

后端 未结 20 1407
梦如初夏
梦如初夏 2020-11-21 22:42

In our application, we receive text files (.txt, .csv, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the

相关标签:
20条回答
  • 2020-11-21 23:04

    If someone is looking for a 93.9% solution. This works for me:

    public static class StreamExtension
    {
        /// <summary>
        /// Convert the content to a string.
        /// </summary>
        /// <param name="stream">The stream.</param>
        /// <returns></returns>
        public static string ReadAsString(this Stream stream)
        {
            var startPosition = stream.Position;
            try
            {
                // 1. Check for a BOM
                // 2. or try with UTF-8. The most (86.3%) used encoding. Visit: http://w3techs.com/technologies/overview/character_encoding/all/
                var streamReader = new StreamReader(stream, new UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true), detectEncodingFromByteOrderMarks: true);
                return streamReader.ReadToEnd();
            }
            catch (DecoderFallbackException ex)
            {
                stream.Position = startPosition;
    
                // 3. The second most (6.7%) used encoding is ISO-8859-1. So use Windows-1252 (0.9%, also know as ANSI), which is a superset of ISO-8859-1.
                var streamReader = new StreamReader(stream, Encoding.GetEncoding(1252));
                return streamReader.ReadToEnd();
            }
        }
    }
    
    0 讨论(0)
  • 2020-11-21 23:05

    You can't detect the codepage

    This is clearly false. Every web browser has some kind of universal charset detector to deal with pages which have no indication whatsoever of an encoding. Firefox has one. You can download the code and see how it does it. See some documentation here. Basically, it is a heuristic, but one that works really well.

    Given a reasonable amount of text, it is even possible to detect the language.

    Here's another one I just found using Google:

    0 讨论(0)
  • 2020-11-21 23:06

    If you're looking to detect non-UTF encodings (i.e. no BOM), you're basically down to heuristics and statistical analysis of the text. You might want to take a look at the Mozilla paper on universal charset detection (same link, with better formatting via Wayback Machine).

    0 讨论(0)
  • 2020-11-21 23:06

    Looking for different solution, I found that

    https://code.google.com/p/ude/

    this solution is kinda heavy.

    I needed some basic encoding detection, based on 4 first bytes and probably xml charset detection - so I've took some sample source code from internet and added slightly modified version of

    http://lists.w3.org/Archives/Public/www-validator/2002Aug/0084.html

    written for Java.

        public static Encoding DetectEncoding(byte[] fileContent)
        {
            if (fileContent == null)
                throw new ArgumentNullException();
    
            if (fileContent.Length < 2)
                return Encoding.ASCII;      // Default fallback
    
            if (fileContent[0] == 0xff
                && fileContent[1] == 0xfe
                && (fileContent.Length < 4
                    || fileContent[2] != 0
                    || fileContent[3] != 0
                    )
                )
                return Encoding.Unicode;
    
            if (fileContent[0] == 0xfe
                && fileContent[1] == 0xff
                )
                return Encoding.BigEndianUnicode;
    
            if (fileContent.Length < 3)
                return null;
    
            if (fileContent[0] == 0xef && fileContent[1] == 0xbb && fileContent[2] == 0xbf)
                return Encoding.UTF8;
    
            if (fileContent[0] == 0x2b && fileContent[1] == 0x2f && fileContent[2] == 0x76)
                return Encoding.UTF7;
    
            if (fileContent.Length < 4)
                return null;
    
            if (fileContent[0] == 0xff && fileContent[1] == 0xfe && fileContent[2] == 0 && fileContent[3] == 0)
                return Encoding.UTF32;
    
            if (fileContent[0] == 0 && fileContent[1] == 0 && fileContent[2] == 0xfe && fileContent[3] == 0xff)
                return Encoding.GetEncoding(12001);
    
            String probe;
            int len = fileContent.Length;
    
            if( fileContent.Length >= 128 ) len = 128;
            probe = Encoding.ASCII.GetString(fileContent, 0, len);
    
            MatchCollection mc = Regex.Matches(probe, "^<\\?xml[^<>]*encoding[ \\t\\n\\r]?=[\\t\\n\\r]?['\"]([A-Za-z]([A-Za-z0-9._]|-)*)", RegexOptions.Singleline);
            // Add '[0].Groups[1].Value' to the end to test regex
    
            if( mc.Count == 1 && mc[0].Groups.Count >= 2 )
            {
                // Typically picks up 'UTF-8' string
                Encoding enc = null;
    
                try {
                    enc = Encoding.GetEncoding( mc[0].Groups[1].Value );
                }catch (Exception ) { }
    
                if( enc != null )
                    return enc;
            }
    
            return Encoding.ASCII;      // Default fallback
        }
    

    It's enough to read probably first 1024 bytes from file, but I'm loading whole file.

    0 讨论(0)
  • 2020-11-21 23:09

    10Y (!) had passed since this was asked, and still I see no mention of MS's good, non-GPL'ed solution: IMultiLanguage2 API.

    Most libraries already mentioned are based on Mozilla's UDE - and it seems reasonable that browsers have already tackled similar problems. I don't know what is chrome's solution, but since IE 5.0 MS have released theirs, and it is:

    1. Free of GPL-and-the-like licensing issues,
    2. Backed and maintained probably forever,
    3. Gives rich output - all valid candidates for encoding/codepages along with confidence scores,
    4. Surprisingly easy to use (it is a single function call).

    It is a native COM call, but here's some very nice work by Carsten Zeumer, that handles the interop mess for .net usage. There are some others around, but by and large this library doesn't get the attention it deserves.

    0 讨论(0)
  • 2020-11-21 23:10

    I use this code to detect Unicode and windows default ansi codepage when reading a file. For other codings a check of content is necessary, manually or by programming. This can de used to save the text with the same encoding as when it was opened. (I use VB.NET)

    'Works for Default and unicode (auto detect)
    Dim mystreamreader As New StreamReader(LocalFileName, Encoding.Default) 
    MyEditTextBox.Text = mystreamreader.ReadToEnd()
    Debug.Print(mystreamreader.CurrentEncoding.CodePage) 'Autodetected encoding
    mystreamreader.Close()
    
    0 讨论(0)
提交回复
热议问题