问题
Usually we can get a string
from a byte[]
using something like
var result = Encoding.UTF8.GetString(bytes);
However, I am having this problem: my input is an IEnumerable<byte[]> bytes
(implementation can be any structure of my choice). It is not guaranteed a character is within a byte[]
(for example, a 2-byte UTF8 char can have its 1st byte in bytes[1][length - 1] and its 2nd byte in bytes[2][0]).
Is there anyway to decode them without merging/copying all the array together? UTF8 is main focus but it is better if other Encoding can be supported. If there is no other solution, I think implementing my own UTF8 reading would be the way.
I plan to stream them using a MemoryStream
, however Encoding cannot work on Stream
, just byte[]
. If merged together, the potential result array may be very large (up to 4GB in List<byte[]>
already).
I am using .NET Standard 2.0. I wish I could use 2.1 (as it is not released yet) and using Span<byte[]>
, would be perfect for my case!
回答1:
The Encoding
class can't deal with that directly, but the Decoder returned from Encoding.GetDecoder() can (indeed, that's its entire reason for existing). StreamReader
uses a Decoder
internally.
It's slightly fiddly to work with though, as it needs to populate a char[]
, rather than returning a string
(Encoding.GetString()
and StreamReader
normally handle the business of populating the char[]
).
The problem with using a MemoryStream
is that you're copying all of the bytes from one array to another, for no gain. If all of your buffers are the same length, you can do this:
var decoder = Encoding.UTF8.GetDecoder();
// +1 in case it includes a work-in-progress char from the previous buffer
char[] chars = decoder.GetMaxCharCount(bufferSize) + 1;
foreach (var byteSegment in bytes)
{
int numChars = decoder.GetChars(byteSegment, 0, byteSegment.Length, chars, 0);
Debug.WriteLine(new string(chars, 0, numChars));
}
If the buffers have different lengths:
var decoder = Encoding.UTF8.GetDecoder();
char[] chars = Array.Empty<char>();
foreach (var byteSegment in bytes)
{
// +1 in case it includes a work-in-progress char from the previous buffer
int charsMinSize = decoder.GetMaxCharCount(bufferSize) + 1;
if (chars.Length < charsMinSize)
chars = new char[charsMinSize];
int numChars = decoder.GetChars(byteSegment, 0, byteSegment.Length, chars, 0);
Debug.WriteLine(new string(chars, 0, numChars));
}
回答2:
however Encoding cannot work on Stream, just byte[].
Correct but a StreamReader : TextReader
can be linked to a Stream.
So just create that MemoryStream, push bytes in on one end and use ReadLine() on the other. I must say I have never tried that.
回答3:
Working code based on Henk's answer using StreamReader
:
using (var memoryStream = new MemoryStream())
{
using (var reader = new StreamReader(memoryStream))
{
foreach (var byteSegment in bytes)
{
memoryStream.Seek(0, SeekOrigin.Begin);
await memoryStream.WriteAsync(byteSegment, 0, byteSegment.Length);
memoryStream.Seek(0, SeekOrigin.Begin);
Debug.WriteLine(await reader.ReadToEndAsync());
}
}
}
来源:https://stackoverflow.com/questions/54970472/can-the-encoding-api-decode-a-stream-noncontinuous-bytes