Determine a string's encoding in C#

前端未结

关注

 9  1941

小鲜肉

Is there any way to determine a string\'s encoding in C#?

Say, I have a filename string, but I don\'t know if it is encoded in ~~Unicode~~ UTF-16 or the

相关标签:

9条回答

长发绾君心

2020-11-22 15:48

Another option, very late in coming, sorry:

http://www.architectshack.com/TextFileEncodingDetector.ashx

This small C#-only class uses BOMS if present, tries to auto-detect possible unicode encodings otherwise, and falls back if none of the Unicode encodings is possible or likely.

It sounds like UTF8Checker referenced above does something similar, but I think this is slightly broader in scope - instead of just UTF8, it also checks for other possible Unicode encodings (UTF-16 LE or BE) that might be missing a BOM.

Hope this helps someone!

0 讨论(0)
发布评论:

提交评论
- 加载中...
情歌与酒

2020-11-22 15:52
My solution is to use built-in stuffs with some fallbacks.

I picked the strategy from an answer to another similar question on stackoverflow but I can't find it now.

It checks the BOM first using the built-in logic in StreamReader, if there's BOM, the encoding will be something other than Encoding.Default, and we should trust that result.

If not, it checks whether the bytes sequence is valid UTF-8 sequence. if it is, it will guess UTF-8 as the encoding, and if not, again, the default ASCII encoding will be the result.
```
static Encoding getEncoding(string path) {
    var stream = new FileStream(path, FileMode.Open);
    var reader = new StreamReader(stream, Encoding.Default, true);
    reader.Read();

    if (reader.CurrentEncoding != Encoding.Default) {
        reader.Close();
        return reader.CurrentEncoding;
    }

    stream.Position = 0;

    reader = new StreamReader(stream, new UTF8Encoding(false, true));
    try {
        reader.ReadToEnd();
        reader.Close();
        return Encoding.UTF8;
    }
    catch (Exception) {
        reader.Close();
        return Encoding.Default;
    }
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
生来不讨喜

2020-11-22 15:57

It depends where the string 'came from'. A .NET string is Unicode (UTF-16). The only way it could be different if you, say, read the data from a database into a byte array.

This CodeProject article might be of interest: Detect Encoding for in- and outgoing text

Jon Skeet's Strings in C# and .NET is an excellent explanation of .NET strings.

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2