Yes is a most frequent question, and this matter is vague for me and since I don't know much about it.
But i would like a very precise way to find a files Encoding. So precise as Notepad++ is.
Yes is a most frequent question, and this matter is vague for me and since I don't know much about it.
But i would like a very precise way to find a files Encoding. So precise as Notepad++ is.
The StreamReader.CurrentEncoding
property rarely returns the correct text file encoding for me. I've had greater success determining a file's endianness, by analyzing its byte order mark (BOM):
/// /// Determines a text file's encoding by analyzing its byte order mark (BOM). /// Defaults to ASCII when detection of the text file's endianness fails. /// /// The text file to analyze. /// The detected encoding. public static Encoding GetEncoding(string filename) { // Read the BOM var bom = new byte[4]; using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read)) { file.Read(bom, 0, 4); } // Analyze the BOM if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7; if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8; if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32; return Encoding.ASCII; }
As a side note, you may want to modify the last line of this method to return Encoding.Default
instead, so the encoding for the OS's current ANSI code page is returned by default.
The following code works fine for me, using the StreamReader
class:
using (var reader = new StreamReader(fileName, defaultEncodingIfNoBom, true)) { reader.Peek(); // you need this! var encoding = reader.CurrentEncoding; }
The trick is to use the Peek
call, otherwise, .NET has not done anything (and it hasn't read the preamble, the BOM). Of course, if you use any other ReadXXX
call before checking the encoding, it works too.
If the file has no BOM, then the defaultEncodingIfNoBom
encoding will be used. There is also a StreamReader without this overload method (in this case, the Default (ANSI) encoding will be used as defaultEncodingIfNoBom), but I recommand to define what you consider the default encoding in your context.
I have tested this successfully with files with BOM for UTF8, UTF16/Unicode (LE & BE) and UTF32 (LE & BE). It does not work for UTF7.
I'd try the following steps:
1) Check if there is a Byte Order Mark
2) Check if the file is valid UTF8
3) Use the local "ANSI" codepage (ANSI as Microsoft defines it)
Step 2 works because most non ASCII sequences in codepages other that UTF8 are not valid UTF8.
Look here for c#
https://msdn.microsoft.com/en-us/library/system.io.streamreader.currentencoding%28v=vs.110%29.aspx
string path = @"path\to\your\file.ext"; using (StreamReader sr = new StreamReader(path, true)) { while (sr.Peek() >= 0) { Console.Write((char)sr.Read()); } //Test for the encoding after reading, or at least //after the first read. Console.WriteLine("The encoding used was {0}.", sr.CurrentEncoding); Console.ReadLine(); Console.WriteLine(); }
The following codes are my Powershell codes to determinate if some cpp or h or ml files are encodeding with ISO-8859-1(Latin-1) or UTF-8 without BOM, if neither then suppose it to be GB18030. I am a Chinese working in France and MSVC saves as Latin-1 on french computer and saves as GB on Chinese computer so this helps me avoid encoding problem when do source file exchanges between my system and my colleagues.
The code is written in PowerShell, but uses .net so it's easy to be translated into C# or F#
Check this.
This is a port of Mozilla Universal Charset Detector and you can use it like this...
public static void Main(String[] args) { string filename = args[0]; using (FileStream fs = File.OpenRead(filename)) { Ude.CharsetDetector cdet = new Ude.CharsetDetector(); cdet.Feed(fs); cdet.DataEnd(); if (cdet.Charset != null) { Console.WriteLine("Charset: {0}, confidence: {1}", cdet.Charset, cdet.Confidence); } else { Console.WriteLine("Detection failed."); } } }