How can I detect the encoding/codepage of a text file

后端 未结 20 1406
梦如初夏
梦如初夏 2020-11-21 22:42

In our application, we receive text files (.txt, .csv, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the

20条回答
  •  醉酒成梦
    2020-11-21 23:06

    Looking for different solution, I found that

    https://code.google.com/p/ude/

    this solution is kinda heavy.

    I needed some basic encoding detection, based on 4 first bytes and probably xml charset detection - so I've took some sample source code from internet and added slightly modified version of

    http://lists.w3.org/Archives/Public/www-validator/2002Aug/0084.html

    written for Java.

        public static Encoding DetectEncoding(byte[] fileContent)
        {
            if (fileContent == null)
                throw new ArgumentNullException();
    
            if (fileContent.Length < 2)
                return Encoding.ASCII;      // Default fallback
    
            if (fileContent[0] == 0xff
                && fileContent[1] == 0xfe
                && (fileContent.Length < 4
                    || fileContent[2] != 0
                    || fileContent[3] != 0
                    )
                )
                return Encoding.Unicode;
    
            if (fileContent[0] == 0xfe
                && fileContent[1] == 0xff
                )
                return Encoding.BigEndianUnicode;
    
            if (fileContent.Length < 3)
                return null;
    
            if (fileContent[0] == 0xef && fileContent[1] == 0xbb && fileContent[2] == 0xbf)
                return Encoding.UTF8;
    
            if (fileContent[0] == 0x2b && fileContent[1] == 0x2f && fileContent[2] == 0x76)
                return Encoding.UTF7;
    
            if (fileContent.Length < 4)
                return null;
    
            if (fileContent[0] == 0xff && fileContent[1] == 0xfe && fileContent[2] == 0 && fileContent[3] == 0)
                return Encoding.UTF32;
    
            if (fileContent[0] == 0 && fileContent[1] == 0 && fileContent[2] == 0xfe && fileContent[3] == 0xff)
                return Encoding.GetEncoding(12001);
    
            String probe;
            int len = fileContent.Length;
    
            if( fileContent.Length >= 128 ) len = 128;
            probe = Encoding.ASCII.GetString(fileContent, 0, len);
    
            MatchCollection mc = Regex.Matches(probe, "^<\\?xml[^<>]*encoding[ \\t\\n\\r]?=[\\t\\n\\r]?['\"]([A-Za-z]([A-Za-z0-9._]|-)*)", RegexOptions.Singleline);
            // Add '[0].Groups[1].Value' to the end to test regex
    
            if( mc.Count == 1 && mc[0].Groups.Count >= 2 )
            {
                // Typically picks up 'UTF-8' string
                Encoding enc = null;
    
                try {
                    enc = Encoding.GetEncoding( mc[0].Groups[1].Value );
                }catch (Exception ) { }
    
                if( enc != null )
                    return enc;
            }
    
            return Encoding.ASCII;      // Default fallback
        }
    

    It's enough to read probably first 1024 bytes from file, but I'm loading whole file.

提交回复
热议问题