C# Issue with reading XML with chars of different encodings in it

后端 未结 1 574
刺人心
刺人心 2021-01-27 05:26

I faced a problem with reading the XML. The solution was found, but there are still some questions. The incorrect XML file is in encoded in UTF-8 and has appropriate mark in its

1条回答
  •  [愿得一人]
    2021-01-27 06:20

    The first thing to note is that the XML file is in fact flawed - mixing text encodings in the same file like this should not be done. The error is even more obvious when the file actually has an explicit encoding embedded.

    As for why it can be read without exception with StreamReader, it's because Encoding contains settings to control what happens when incompatible data is encountered

    Encoding.UTF8 is documented to use fallback characters. From http://msdn.microsoft.com/en-us/library/system.text.encoding.utf8.aspx:

    The UTF8Encoding object that is returned by this property may not have the appropriate behavior for your application. It uses replacement fallback to replace each string that it cannot encode and each byte that it cannot decode with a question mark ("?") character.

    You can instantiate the encoding yourself to get different settings. This is most probably what XDocument.Load() does, as it would generally be bad to hide errors by default. http://msdn.microsoft.com/en-us/library/system.text.utf8encoding.aspx

    If you are being sent such broken XML files step 1 is to complain (loudly) about it. There is no valid reason for such behavior. If you then absolutely must process them anyway, I suggest having a look at the UTF8Encoding class and its DecoderFallbackProperty. It seems you should be able to implement a custom DecoderFallback and DecoderFallbackBuffer to add logic that will understand the UTF-16 byte sequence.

    0 讨论(0)
提交回复
热议问题