Effective way to find any file's Encoding

匿名 (未验证) 提交于 2019-12-03 01:48:02

问题:

Yes is a most frequent question, and this matter is vague for me and since I don't know much about it.

But i would like a very precise way to find a files Encoding. So precise as Notepad++ is.

回答1:

The StreamReader.CurrentEncoding property rarely returns the correct text file encoding for me. I've had greater success determining a file's endianness, by analyzing its byte order mark (BOM):

///  /// Determines a text file's encoding by analyzing its byte order mark (BOM). /// Defaults to ASCII when detection of the text file's endianness fails. ///  /// The text file to analyze. /// The detected encoding. public static Encoding GetEncoding(string filename) {     // Read the BOM     var bom = new byte[4];     using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))     {         file.Read(bom, 0, 4);     }      // Analyze the BOM     if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;     if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;     if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE     if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE     if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32;     return Encoding.ASCII; } 

As a side note, you may want to modify the last line of this method to return Encoding.Default instead, so the encoding for the OS's current ANSI code page is returned by default.



回答2:

The following code works fine for me, using the StreamReader class:

  using (var reader = new StreamReader(fileName, defaultEncodingIfNoBom, true))   {       reader.Peek(); // you need this!       var encoding = reader.CurrentEncoding;   } 

The trick is to use the Peek call, otherwise, .NET has not done anything (and it hasn't read the preamble, the BOM). Of course, if you use any other ReadXXX call before checking the encoding, it works too.

If the file has no BOM, then the defaultEncodingIfNoBom encoding will be used. There is also a StreamReader without this overload method (in this case, the Default (ANSI) encoding will be used as defaultEncodingIfNoBom), but I recommand to define what you consider the default encoding in your context.

I have tested this successfully with files with BOM for UTF8, UTF16/Unicode (LE & BE) and UTF32 (LE & BE). It does not work for UTF7.



回答3:

I'd try the following steps:

1) Check if there is a Byte Order Mark

2) Check if the file is valid UTF8

3) Use the local "ANSI" codepage (ANSI as Microsoft defines it)

Step 2 works because most non ASCII sequences in codepages other that UTF8 are not valid UTF8.



回答4:

Look here for c#

https://msdn.microsoft.com/en-us/library/system.io.streamreader.currentencoding%28v=vs.110%29.aspx

string path = @"path\to\your\file.ext";  using (StreamReader sr = new StreamReader(path, true)) {     while (sr.Peek() >= 0)     {         Console.Write((char)sr.Read());     }      //Test for the encoding after reading, or at least     //after the first read.     Console.WriteLine("The encoding used was {0}.", sr.CurrentEncoding);     Console.ReadLine();     Console.WriteLine(); } 


回答5:

The following codes are my Powershell codes to determinate if some cpp or h or ml files are encodeding with ISO-8859-1(Latin-1) or UTF-8 without BOM, if neither then suppose it to be GB18030. I am a Chinese working in France and MSVC saves as Latin-1 on french computer and saves as GB on Chinese computer so this helps me avoid encoding problem when do source file exchanges between my system and my colleagues.

The code is written in PowerShell, but uses .net so it's easy to be translated into C# or F#



回答6:

Check this.

UDE

This is a port of Mozilla Universal Charset Detector and you can use it like this...

public static void Main(String[] args) {     string filename = args[0];     using (FileStream fs = File.OpenRead(filename)) {         Ude.CharsetDetector cdet = new Ude.CharsetDetector();         cdet.Feed(fs);         cdet.DataEnd();         if (cdet.Charset != null) {             Console.WriteLine("Charset: {0}, confidence: {1}",                   cdet.Charset, cdet.Confidence);         } else {             Console.WriteLine("Detection failed.");         }     } } 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!