C# partial UTF-8 byte stream conversion

此生再无相见时 提交于 2021-02-04 20:51:31

问题


I have wrote the following simple test:

[Test]
public void TestUTF8()
{
    var c = "abc☰def";
    var b = Encoding.UTF8.GetBytes(c);

    Assert.That(b.Length, Is.EqualTo(9));
    //Assuming, you are reading a byte stream and got partial result with the first 5 bytes
    var p = Encoding.UTF8.GetChars(b, 0, 5);
    Trace.WriteLine(new string(p));
    Assert.That(p.Length, Is.EqualTo(3));
}

The Trace outputs abc� and the last assert fails because p.Length is 4.

However, I wanted Trace outputs abc and the last assert passes, since in reality I know the stream will have valid chars and when it is not the case for the last few bytes, just leave them there waiting for more data to come.

So how can I achieve this in C#?


回答1:


Encoding.GetChars isn't really designed for bytes coming from a stream where some state needs to be kept track of during the decoding process because a single character might span multiple buffer segments. To do that work you should use a Decoder obtained from Encoding.GetDecoder. However, Decoder.Convert is really low-level allowing you control over both the input and output buffers and somewhat difficult to use. Decoder.GetChars is somewhat easier to use and does the important work of storing state between calls. We can easily expand on Peter Duniho's answer for arbitrary buffer size:

public static void Main(string[] args)
{
    var c = "abc☰def";
    var b = Encoding.UTF8.GetBytes(c);
    var result = DecodeFromStream(new MemoryStream(b), Encoding.UTF8, 3);
    Console.WriteLine(result);
    Console.WriteLine(c == result);
}

private static string DecodeFromStream(Stream dataStream, Encoding encoding, int bufferSize)
{
    Decoder decoder = encoding.GetDecoder();
    StringBuilder sb = new StringBuilder();
    int inputByteCount;
    byte[] inputBuffer = new byte[bufferSize];
    char[] charBuffer = new char[encoding.GetMaxCharCount(inputBuffer.Length)];

    while ((inputByteCount = dataStream.Read(inputBuffer, 0, inputBuffer.Length)) > 0)
    {                   
       int readChars = decoder.GetChars(inputBuffer, 0, inputByteCount, charBuffer, 0);
       if (readChars > 0)
           sb.Append(charBuffer, 0, readChars);
    }
    return sb.ToString();
}


来源:https://stackoverflow.com/questions/26900642/c-sharp-partial-utf-8-byte-stream-conversion

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!