Converting special charactes such as ü and à back to their original, latin alphbet counterparts in C#

后端 未结 5 896
感情败类
感情败类 2020-12-30 02:02

I have been given an export from a MySQL database that seems to have had it\'s encoding muddled somewhat over time and contains a mix of HTML char codes such as

相关标签:
5条回答
  • 2020-12-30 02:05

    The data is only partly unrecoverable due to Windows-1252 encoding having 5 unassigned slots. Some modifications of Windows-1252 fill these with control characters but those don't make it to posts in Stackoverflow. If modified Windows-1252 has been used you can fully recover as long as you don't lose the hidden control characters in copy pastes.

    There is also the non-breaking space character that is ignored or turned into a space usually with copypastes, but that's not an issue when you deal with bytes directly.

    The misencoding abuse this string has gone through is:

    UTF-8 -> Windows-1252 -> UTF-8 -> Windows-1252
    

    To recover, here is an example:

    String a = "Desinfektionslösungstücher für Flächen";
    Encoding utf8 = Encoding.GetEncoding(65001);
    Encoding win1252 = Encoding.GetEncoding(1252);
    
    string result = utf8.GetString(win1252.GetBytes(utf8.GetString(win1252.GetBytes(a))));
    
    Console.WriteLine(result);
    //Desinfektionslösungstücher für Flächen
    
    0 讨论(0)
  • 2020-12-30 02:12

    Well, first of all, as the data has been decoded using the wrong encoding, it's likely that some of the characters are impossible to recover. It looks like it's UTF-8 data that incorrectly decoded using an 8-bit encoding.

    There is no built in method to recover data like this, because it's not something that you normally do. There is no reliable way to decode the data, because it's already broken.

    What you can try, is to encode the data, and decode it using the wrong encoding again, just the other way around:

    byte[] data = Encoding.Default.GetBytes(input);
    string output = Encoding.UTF8.GetString(data);
    

    The Encoding.Default uses the current ANSI encoding for your system. You can try some different encodings there and see which one gives the best result.

    0 讨论(0)
  • 2020-12-30 02:24

    Here you can find a completer list:

    http://bueltge.de/wp-content/download/wk/utf-8_kodierungen.pdf

    0 讨论(0)
  • 2020-12-30 02:24

    I've been troubled by this char problem before. Solution:

    My .(cs)html file was UTF-8; I converted to UTF-8Y (UTF-8 with a BOM).

    0 讨论(0)
  • 2020-12-30 02:29

    It's probably windows-1252 encoded string which you read as UTF-8.

    As Guffa mentioned data has been corrupted.

    Lets take a look on bytes:
    ö -> C3B6 in UTF8

    in windows-1252 C3 ->Ã B6 ->¶

    so ö ->ö

    what about all these "ƒÂ":

    ƒ ->83 Â ->C2

    Honesty i don't know why they appear, but you can try erase them and do some conversions as Guffa mentioned. Good luck

    0 讨论(0)
提交回复
热议问题