问题
I'm reading a XML file with dom4j. The file looks like this:
...
<Field> hello, world...</Field>
...
I read the file with SAXReader
into a Document
. When I use getText()
on a the node I obtain the followin String:
\r\n hello, world...
I do some processing and then write another file using asXml()
. But the characters are not escaped as in the original file which results in error in the external system which uses the file.
How can I escape the special character and have
when writing the file?
回答1:
You cannot easily. Those aren't 'escapes', they are 'character entities'. They are a fundamental part of XML. Xerces has some very complex support for 'unparsed entities', but I doubt that it applies to these, as opposed to the species that are defined in a DTD.
回答2:
It depends on what you're getting and what you want (see my previous comment.)
The SAX reader is doing nothing wrong - your XML is giving you a literal newline character. If you control this XML, then instead of the newline characters, you will need to insert a \ (backslash) character following by the "r" or "n" characters (or both.)
If you do not control this XML, then you will need to do a literal conversion of the newline character to "\r\n" after you've gotten your string back. In C# it would be something like:
myString = myString.Replace("\r\n", "\\r\\n");
回答3:
XML entities are abstracted away in DOM. Content is exposed with String without the need to bother about the encoding -- which in most of the case is what you want.
But SAX has some support for how entities are processed. You could try to create a XMLReader
with a custom EntityResolver#resolveEntity
, and pass it as parameter to the SAXReader
. But I feat it may not work:
The Parser will call this method before opening any external entity except the top-level document entity (including the external DTD subset, external entities referenced within the DTD, and external entities referenced within the document element)
Otherwise you could try to configure a LexicalHandler
for SAX in a way to be notified when an entity is encountered. Javadoc for LexicalHandler#startEntity
says:
Report the beginning of some internal and external XML entities.
You will not be able to change the resolving, but that may still help.
EDIT
You must read and write XML with the SAXReader
and XMLWriter
provided by dom4j. See reading a XML file and writing an XML file. Don't use asXml()
and dump the file yourself.
FileOutputStream fos = new FileOutputStream("simple.xml");
OutputFormat format = OutputFormat.createPrettyPrint();
XMLWriter writer = new XMLWriter(fos, format);
writer.write(doc);
writer.flush();
回答4:
You can pre-process the input stream to replace &
to e.g. [$AMPERSAND_CHARACTER$]
, then do the stuff with dom4j, and post-process the output stream making the back substitution.
Example (using streamflyer):
import com.github.rwitzel.streamflyer.util.ModifyingReaderFactory;
import com.github.rwitzel.streamflyer.util.ModifyingWriterFactory;
// Pre-process
Reader originalReader = new InputStreamReader(myInputStream, "utf-8");
Reader modifyingReader = new ModifyingReaderFactory().createRegexModifyingReader(originalReader, "&", "[\\$AMPERSAND_CHARACTER\\$]");
// Read and modify XML via dom4j
SAXReader xmlReader = new SAXReader();
Document xmlDocument = xmlReader.read(modifyingReader);
// ...
// Post-process
Writer originalWriter = new OutputStreamWriter(myOutputStream, "utf-8");
Writer modifyingWriter = new ModifyingWriterFactory().createRegexModifyingWriter(originalWriter, "\\[\\$AMPERSAND_CHARACTER\\$\\]", "&");
// Write to output stream
OutputFormat xmlOutputFormat = OutputFormat.createPrettyPrint();
XMLWriter xmlWriter = new XMLWriter(modifyingWriter, xmlOutputFormat);
xmlWriter.write(xmlDocument);
xmlWriter.close();
You can also use FilterInputStream/FilterOutputStream, PipedInputStream/PipedOutputStream, or ProxyInputStream/ProxyOutputStream for pre- and post-processing.
来源:https://stackoverflow.com/questions/2251963/saxreader-not-re-ecape-characters