I have the following name in a file and I need to read the string as a UTF8-encoded string, so from this:
test_\\303\\246\\303\\270\\303\\245.txt
Assuming you have this string:
string input = "test_\\303\\246\\303\\270\\303\\245.txt";
I.E. literally
test_\303\246\303\270\303\245.txt
You could do this:
string input = "test_\\303\\246\\303\\270\\303\\245.txt";
Encoding iso88591 = Encoding.GetEncoding(28591); //See note at the end of answer
Encoding utf8 = Encoding.UTF8;
//Turn the octal escape sequences into characters having codepoints 0-255
//this results in a "binary string"
string binaryString = Regex.Replace(input, @"\\(?<num>[0-7]{3})", delegate(Match m)
{
String oct = m.Groups["num"].ToString();
return Char.ConvertFromUtf32(Convert.ToInt32(oct, 8));
});
//Turn the "binary string" into bytes
byte[] raw = iso88591.GetBytes(binaryString);
//Read the bytes into C# string
string output = utf8.GetString(raw);
Console.WriteLine(output);
//test_æøå.txt
by "binary string", I mean a string consisting only of characters with codepoints 0-255. It therefore amounts to a poor man's byte[]
where
you retrieve the codepoint of character at index i
, instead of a byte
value in a byte[]
at index i
(This is what we did in javascript a few years ago). Because iso-8859-1 maps
exactly the first 256 unicode code points into a single byte, it's perfect for converting a "binary string" into a byte[]
.