Read UTF8/UNICODE characters from an escaped ASCII sequence

后端 未结 1 1406
春和景丽
春和景丽 2021-01-27 18:24

I have the following name in a file and I need to read the string as a UTF8-encoded string, so from this:

test_\\303\\246\\303\\270\\303\\245.txt
相关标签:
1条回答
  • 2021-01-27 18:45

    Assuming you have this string:

    string input = "test_\\303\\246\\303\\270\\303\\245.txt";
    

    I.E. literally

    test_\303\246\303\270\303\245.txt
    

    You could do this:

    string input = "test_\\303\\246\\303\\270\\303\\245.txt";
    Encoding iso88591 = Encoding.GetEncoding(28591); //See note at the end of answer
    Encoding utf8 = Encoding.UTF8;
    
    
    //Turn the octal escape sequences into characters having codepoints 0-255
    //this results in a "binary string"
    string binaryString = Regex.Replace(input, @"\\(?<num>[0-7]{3})", delegate(Match m)
    {
        String oct = m.Groups["num"].ToString();
        return Char.ConvertFromUtf32(Convert.ToInt32(oct, 8));
    
    });
    
    //Turn the "binary string" into bytes
    byte[] raw = iso88591.GetBytes(binaryString);
    
    //Read the bytes into C# string
    string output = utf8.GetString(raw);
    Console.WriteLine(output);
    //test_æøå.txt
    

    by "binary string", I mean a string consisting only of characters with codepoints 0-255. It therefore amounts to a poor man's byte[] where you retrieve the codepoint of character at index i, instead of a byte value in a byte[] at index i (This is what we did in javascript a few years ago). Because iso-8859-1 maps exactly the first 256 unicode code points into a single byte, it's perfect for converting a "binary string" into a byte[].

    0 讨论(0)
提交回复
热议问题