Determine a string's encoding in C#

前端 未结 9 1933
小鲜肉
小鲜肉 2020-11-22 14:54

Is there any way to determine a string\'s encoding in C#?

Say, I have a filename string, but I don\'t know if it is encoded in Unicode UTF-16 or the

相关标签:
9条回答
  • 2020-11-22 15:48

    Another option, very late in coming, sorry:

    http://www.architectshack.com/TextFileEncodingDetector.ashx

    This small C#-only class uses BOMS if present, tries to auto-detect possible unicode encodings otherwise, and falls back if none of the Unicode encodings is possible or likely.

    It sounds like UTF8Checker referenced above does something similar, but I think this is slightly broader in scope - instead of just UTF8, it also checks for other possible Unicode encodings (UTF-16 LE or BE) that might be missing a BOM.

    Hope this helps someone!

    0 讨论(0)
  • 2020-11-22 15:52

    My solution is to use built-in stuffs with some fallbacks.

    I picked the strategy from an answer to another similar question on stackoverflow but I can't find it now.

    It checks the BOM first using the built-in logic in StreamReader, if there's BOM, the encoding will be something other than Encoding.Default, and we should trust that result.

    If not, it checks whether the bytes sequence is valid UTF-8 sequence. if it is, it will guess UTF-8 as the encoding, and if not, again, the default ASCII encoding will be the result.

    static Encoding getEncoding(string path) {
        var stream = new FileStream(path, FileMode.Open);
        var reader = new StreamReader(stream, Encoding.Default, true);
        reader.Read();
    
        if (reader.CurrentEncoding != Encoding.Default) {
            reader.Close();
            return reader.CurrentEncoding;
        }
    
        stream.Position = 0;
    
        reader = new StreamReader(stream, new UTF8Encoding(false, true));
        try {
            reader.ReadToEnd();
            reader.Close();
            return Encoding.UTF8;
        }
        catch (Exception) {
            reader.Close();
            return Encoding.Default;
        }
    }
    
    0 讨论(0)
  • 2020-11-22 15:57

    It depends where the string 'came from'. A .NET string is Unicode (UTF-16). The only way it could be different if you, say, read the data from a database into a byte array.

    This CodeProject article might be of interest: Detect Encoding for in- and outgoing text

    Jon Skeet's Strings in C# and .NET is an excellent explanation of .NET strings.

    0 讨论(0)
提交回复
热议问题