How can I detect the encoding/codepage of a text file

后端 未结 20 1414
梦如初夏
梦如初夏 2020-11-21 22:42

In our application, we receive text files (.txt, .csv, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the

20条回答
  •  小鲜肉
    小鲜肉 (楼主)
    2020-11-21 23:04

    If someone is looking for a 93.9% solution. This works for me:

    public static class StreamExtension
    {
        /// 
        /// Convert the content to a string.
        /// 
        /// The stream.
        /// 
        public static string ReadAsString(this Stream stream)
        {
            var startPosition = stream.Position;
            try
            {
                // 1. Check for a BOM
                // 2. or try with UTF-8. The most (86.3%) used encoding. Visit: http://w3techs.com/technologies/overview/character_encoding/all/
                var streamReader = new StreamReader(stream, new UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true), detectEncodingFromByteOrderMarks: true);
                return streamReader.ReadToEnd();
            }
            catch (DecoderFallbackException ex)
            {
                stream.Position = startPosition;
    
                // 3. The second most (6.7%) used encoding is ISO-8859-1. So use Windows-1252 (0.9%, also know as ANSI), which is a superset of ISO-8859-1.
                var streamReader = new StreamReader(stream, Encoding.GetEncoding(1252));
                return streamReader.ReadToEnd();
            }
        }
    }
    

提交回复
热议问题