Why is StreamReader and sr.BaseStream.Seek() giving Junk Characters even in UTF8 Encoding

问题

The abc.txt File Contents are

ABCDEFGHIJ•XYZ

Now, The Character Shown is Fine if I use this code (i.e. Seek to position 9),

            string filePath = "D:\\abc.txt";
            FileStream fs = new FileStream(filePath, FileMode.Open);
            StreamReader sr = new StreamReader(fs, new UTF8Encoding(true), true);
            sr.BaseStream.Seek(9, SeekOrigin.Begin);
            char[] oneChar = new char[1];
            char ch = (char)sr.Read(oneChar, 0, 1);
            MessageBox.Show(oneChar[0].ToString());

But if the SEEK position is Just after that Special Dot Character, then I Get Junk Character.

So, I get Junk Character if I do Seek to position 11 (i.e. just after the dot position)

sr.BaseStream.Seek(11, SeekOrigin.Begin);

This should give 'X', because the character at 11th position is X.

I think the File contents are legally UTF8.

There is also one more thing, The StreamReader BaseStream length and the StreamReader Contents Length is different.

   MessageBox.Show(sr.BaseStream.Length.ToString());
   MessageBox.Show(sr.ReadToEnd().Length.ToString());

回答1:

Why is StreamReader and sr.BaseStream.Seek() giving Junk Characters even in UTF8 Encoding

It is exactly because of UTF-8 that sr.BaseStream is giving junk characters. :)

StreamReader is a relatively "smarter" stream. It understands how strings work, whereas FileStream (i.e. sr.BaseStream) doesn't. FileStream only knows about bytes.

Since your file is encoded in UTF-8 (a variable-length encoding), letters like A, B and C are encoded with 1 byte, but the • character needs 3 bytes. You can get how many bytes a character needs by doing:

Console.WriteLine(Encoding.UTF8.GetByteCount("•"));

So when you move the stream to "the position just after •", you haven't actually moved past the •, you are just on the second byte of it.

The reason why the Lengths are different is similar: StreamReader gives you the number of characters, whereas sr.BaseStream gives you the number of bytes.

来源：https://stackoverflow.com/questions/60202410/why-is-streamreader-and-sr-basestream-seek-giving-junk-characters-even-in-utf8

标签

utf-8

streamreader