Is there any way to determine a string\'s encoding in C#?
Say, I have a filename string, but I don\'t know if it is encoded in Unicode UTF-16 or the
Another option, very late in coming, sorry:
http://www.architectshack.com/TextFileEncodingDetector.ashx
This small C#-only class uses BOMS if present, tries to auto-detect possible unicode encodings otherwise, and falls back if none of the Unicode encodings is possible or likely.
It sounds like UTF8Checker referenced above does something similar, but I think this is slightly broader in scope - instead of just UTF8, it also checks for other possible Unicode encodings (UTF-16 LE or BE) that might be missing a BOM.
Hope this helps someone!
My solution is to use built-in stuffs with some fallbacks.
I picked the strategy from an answer to another similar question on stackoverflow but I can't find it now.
It checks the BOM first using the built-in logic in StreamReader, if there's BOM, the encoding will be something other than Encoding.Default
, and we should trust that result.
If not, it checks whether the bytes sequence is valid UTF-8 sequence. if it is, it will guess UTF-8 as the encoding, and if not, again, the default ASCII encoding will be the result.
static Encoding getEncoding(string path) {
var stream = new FileStream(path, FileMode.Open);
var reader = new StreamReader(stream, Encoding.Default, true);
reader.Read();
if (reader.CurrentEncoding != Encoding.Default) {
reader.Close();
return reader.CurrentEncoding;
}
stream.Position = 0;
reader = new StreamReader(stream, new UTF8Encoding(false, true));
try {
reader.ReadToEnd();
reader.Close();
return Encoding.UTF8;
}
catch (Exception) {
reader.Close();
return Encoding.Default;
}
}
It depends where the string 'came from'. A .NET string is Unicode (UTF-16). The only way it could be different if you, say, read the data from a database into a byte array.
This CodeProject article might be of interest: Detect Encoding for in- and outgoing text
Jon Skeet's Strings in C# and .NET is an excellent explanation of .NET strings.