Producing valid XML with Java and UTF-8 encoding

前端 未结 2 994
无人共我
无人共我 2020-12-02 19:04

I am using JAXP to generate and parse an XML document from which some fields are loaded from a database.

Code to serialize the XML:

DocumentBuilder b         


        
相关标签:
2条回答
  • 2020-12-02 19:40

    Well, for sure 0xFC and 0xF6 are not valid UTF-8 characters. These should have been finnesed to the two byte sequences: 0x3CBC and 0x3CB6.

    Most likely the problem is with the original source of the characters being defined as UTF-8 when they are not.

    0 讨论(0)
  • 2020-12-02 19:48

    Use a FileOutputStream rather than a FileWriter.

    The latter applies its own encoding, which is almost certainly not UTF-8 (depending on your platform, it's probably Windows-1252 or IS-8859-1).

    Edit (now that I have some time):

    An XML document without a prologue is permitted to be encoded as UTF-8 or UTF-16. With a prologue, it iss allowed to specify its encoding (the prologue can contain only US-ASCII characters, so prologue is always readable).

    A Reader deals with characters; it will decode the byte stream of the underlying InputStream. As a result, when you pass a Reader to the parser, you are telling it that you've already handled the encoding, so the parser will ignore the prologue. When you pass an InputStream (which reads bytes), it does not make this assumption, and will look to the prologue to define the encoding -- or default to UTF-8/UTF-16 if it's not there.

    I've never tried reading a file that is encoded in UTF-16. I suspect that the parser will look for a Byte Order Mark (BOM) as the first 2 bytes of the file.

    0 讨论(0)
提交回复
热议问题