XML parsing with SAX | how to handle special characters?

佐手、 提交于 2020-01-02 08:14:29

问题


We have a JAVA application that pulls the data from SAP, parses it and renders to the users. The data is pulled using JCO connector.

Recently we were thrown an exception:

org.xml.sax.SAXParseException: Character reference "&#00" is an invalid XML character.

So, we are planning to write a new level of indirection where ALL special/illegal characters are replaced BEFORE parsing the XML.

My questions here are :

  1. Is there any existing(open source) utility that does this job of replacing illegal characters in XML?
  2. Or if I had to write such utility, how should i handle them?
  3. Why is the above exception thrown?

Thank You.


回答1:


From my point of view, the source (SAP) should do the replacement. Otherwise, what it transmits to your programm may looks like XML, but is not.

While replacing the '&' by '&' can be done by a simple String.replaceAll(...) to the string from to toXML() call, others characters can be harder to replace (the '<' and '>' for exemple).

regards Guillaume




回答2:


It sounds like a bug in their escaping. Depending on context you might be best off just writing your own version of their XMLWriter class that uses a real XML library rather than trying to write your own XML utilities like the SAP developers did.

Alternatively, looking at the character code, &#00, you might be able to get away with a replace all on it with the empty string:

String goodXml = badXml.replaceAll("&#00;", "");



回答3:


I've had a related, but opposite problem, where I was trying to insert character 1 into the output of an XSLT transformation. I considered post-processing to replace a marker with the zero, but instead chose to use an xsl:param.

If I was in your situation, I'd either come up with a bespoke encoding, replacing the characters which are invalid in XML, and handling them as special cases in your parsing, or if possible, replace them with whitespace.

I don't have experience with JCO, so can't advise on how or where I'd replace the invalid characters.




回答4:


You can encode/decode non-ASCII characters in XML by using the Apache Commons Lang class StringEscapeUtils escapeXML method. See:

http://commons.apache.org/lang/api-2.4/index.html

To read about how XML character references work, search for "numeric character references" on wikipedia.



来源:https://stackoverflow.com/questions/2467830/xml-parsing-with-sax-how-to-handle-special-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!