问题
I have an XmlDocument
that includes Kanji in its text content, and I need to write it to a stream using ISO-8859-1 encoding. When I do, none of the Kanji characters are encoded properly, and are instead replaced with "??".
Here is sample code that demonstrates how the XML is written from the XmlDocument
:
MemoryStream mStream = new MemoryStream();
Encoding enc = Encoding.GetEncoding("ISO-8859-1");
XmlTextWriter writer = new XmlTextWriter(mStream,enc);
doc.WriteTo(writer);
writer.Flush();
mStream.Flush();
mStream.Position = 0;
StreamReader sReader = new StreamReader(mStream, enc);
String formattedXML = sReader.ReadToEnd();
What can be done to correctly encode Kanji in this specific situation?
回答1:
As mentioned in comments, the ?
character is showing up because Kanji characters are not supported by the encoding ISO-8859-1, so it substitutes ?
as a fallback character. Encoding fallbacks are discussed in the Documentation Remarks for Encoding:
Note that the encoding classes allow errors (unsupported characters) to:
- Silently change to a "?" character.
- Use a "best fit" character.
- Change to an application-specific behavior through use of the EncoderFallback and DecoderFallback classes with the U+FFFD Unicode replacement character.
This is the behavior you are seeing.
However, even though Kanji characters are not supported by ISO-8859-1
, you can get a much better result by switching to the newer XmlWriter
returned by XmlWriter.Create(Stream, XmlWriterSettings) and setting your encoding on XmlWriterSettings.Encoding like so:
MemoryStream mStream = new MemoryStream();
var enc = Encoding.GetEncoding("ISO-8859-1");
var settings = new XmlWriterSettings
{
Encoding = enc,
CloseOutput = false,
// Remove to enable the XML declaration if you want it. XmlTextWriter doesn't include it automatically.
OmitXmlDeclaration = true,
};
using (var writer = XmlWriter.Create(mStream, settings))
{
doc.WriteTo(writer);
}
mStream.Position = 0;
var sReader = new StreamReader(mStream, enc);
var formattedXML = sReader.ReadToEnd();
By setting the Encoding
property of XmlWriterSettings
, the XML writer will be made aware whenever a character is not supported by the current encoding and automatically replace it with an XML character entity reference rather than some hardcoded fallback.
E.g. say you have XML like the following:
<Root>
<string>畑 はたけ hatake "field of crops"</string>
</Root>
Then your code will output the following, mapping all Kanji to the single fallback character:
<Root><string>? ??? hatake "field of crops"</string></Root>
Whereas the new version will output:
<Root><string>畑 はたけ hatake "field of crops"</string></Root>
Notice that the Kanji characters have been replaced with character entities such as 畑
? All compliant XML parsers will recognize and reconstruct those characters, and thus no information will be lost despite the fact that your preferred encoding does not support Kanji.
Finally, as an aside note the documentation for XmlTextWriter states:
Starting with the .NET Framework 2.0, we recommend that you use the System.Xml.XmlWriter class instead.
So replacing it with an XmlWriter
is a good idea in general.
Sample .Net fiddle demonstrating usage of both writers and asserting that the XML generated by XmlWriter
is semantically equivalent to the original XML despite the escaping of characters.
来源:https://stackoverflow.com/questions/48402686/xmldocument-with-kanji-text-content-is-not-encoded-correctly-to-iso-8859-1-using