问题
I’m using Java 6. I have an XML template, which begins like so
<?xml version="1.0" encoding="UTF-8"?>
However, I notice when I parse and output it with the following code (using Apache Commons-io 2.4) …
Document doc = null;
InputStream in = this.getClass().getClassLoader().getResourceAsStream(“my-template.xml”);
try
{
byte[] data = org.apache.commons.io.IOUtils.toByteArray( in );
InputSource src = new InputSource(new StringReader(new String(data)));
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
doc = builder.parse(src);
}
finally
{
in.close();
}
The first line is output as
<?xml version="1.0" encoding="UTF-16”?>
What do I need to do when parsing/outputting the file so that the header encoding will remain “UTF-8”?
Edit: Per the suggestion given, I changed my code to
Document doc = null;
InputStream in = this.getClass().getClassLoader().getResourceAsStream(name);
try
{
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
doc = builder.parse(in);
}
finally
{
in.close();
}
But despite the fact my input element template file's first line is
<?xml version="1.0" encoding="UTF-8"?>
when i output the document as a String it produces
<?xml version="1.0" encoding="UTF-16"?>
as a first line. Here's what I use to output the "doc" object as a string ...
private String getDocumentString(Document doc)
{
DOMImplementationLS domImplementation = (DOMImplementationLS)doc.getImplementation();
LSSerializer lsSerializer = domImplementation.createLSSerializer();
return lsSerializer.writeToString(doc);
}
回答1:
new StringReader(new String(data))
This is wrong. You should let the parser detect the document encoding by using (for example) DocumentBuilder.parse(InputStream):
doc = builder.parse(in);
What encoding the DOM is serialized to depends on how you write it. The in-memory DOM has no concept of encoding.
Writing the document to a string with a UTF-8 declaration:
import java.io.StringWriter;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.ls.*;
public class DomIO {
public static void main(String[] args) throws Exception {
Document doc = DocumentBuilderFactory.newInstance()
.newDocumentBuilder()
.newDocument();
doc.appendChild(doc.createElement("foo"));
System.out.println(getDocumentString(doc));
}
public static String getDocumentString(Document doc) {
DOMImplementationLS domImplementation = (DOMImplementationLS)
doc.getImplementation();
LSSerializer lsSerializer = domImplementation.createLSSerializer();
LSOutput lsOut = domImplementation.createLSOutput();
lsOut.setEncoding("UTF-8");
lsOut.setCharacterStream(new StringWriter());
lsSerializer.write(doc, lsOut);
return lsOut.getCharacterStream().toString();
}
}
The LSOutput also has binary stream support if you want the serializer to encode the document correctly on output.
回答2:
Turns out that when I changed my Document -> String method to
private String getDocumentString(Document doc)
{
String ret = null;
DOMSource domSource = new DOMSource(doc);
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer;
try
{
transformer = tf.newTransformer();
transformer.transform(domSource, result);
ret = writer.toString();
}
catch (TransformerConfigurationException e)
{
e.printStackTrace();
}
catch (TransformerException e)
{
e.printStackTrace();
}
return ret;
}
the 'encoding="UTF-8"' headers no longer got output as 'encoding="UTF-16"'.
来源:https://stackoverflow.com/questions/28546634/how-does-apache-commons-io-convert-my-xml-header-from-utf-8-to-utf-16