I am trying to parse an XML file whcih contains some special characters like \"&\" using DOM parser. I am getting the saxparse exception \"the reference to entity must e
In complement of @PSpeed's answer, here is a complete solution (SAX parser):
try {
InputStream xmlStreamToParse = blob.getBinaryStream();
// Clean
BufferedReader br = new BufferedReader(new InputStreamReader(xmlStreamToParse));
StringBuilder sb = new StringBuilder();
String line;
while ((line = br.readLine()) != null) {
sb.append(line.replaceAll("&([^;]+(?!(?:\\w|;)))", "&$1")); // or whatever you want to clean
}
InputStream stream = org.apache.commons.io.IOUtils.toInputStream(sb.toString(), "UTF-8");
// Parsing
SAXParserFactory saxFactory = SAXParserFactory.newInstance();
saxFactory.setNamespaceAware(true);
SAXParser theParser = saxFactory.newSAXParser();
XMLReader xmlReader = theParser.getXMLReader();
LicenceXMLHandler licence = new LicenceXMLHandler();
xmlReader.setContentHandler(licence);
xmlReader.parse(new InputSource(stream));
} catch (SQLException | SAXException | IOException | ParserConfigurationException e) {
log.error("Error: " + e);
}
Explanations:
Simply replace your &
with &
and it will work.
Your input is invalid XML. Specifically, you cannot have an '&' character in an attribute value unless it is part of a well-formed character entity reference.
AFAIK, you have two choices:
Some of you might be familiar with the ERROR “The reference to entity XX must end with the ‘;’ delimiter” while adding or altering any piece of code to your XML Templates. Even I get that ERROR sometimes when I try to alter or add some codes to my blogger blog’s templates(XML).
Mostly these kind of ERRORS occur while we add any third-party banner or widgets to our XML Templates. We can easily rectify that ERROR by making a slight alteration in the piece of code we add!
Just replace “&” with “&” in your HTML/Javascript code!
EXAMPLE
Original Code:
<!– Begin Code –>
<script src="http://XXXXXX.com/XXX.php?sid=XXX&br=XXX&dk=XXXXXXXXXXXX" type="text/javascript"/>
<!– End Code –>
Altered Code:
<!– Begin Code –>
<script src="http://XXXXXX.com/XXX.php?sid=XXX&br=XXX&dk=XXXXXXXXXXXX" type="text/javascript"/>
<!– End Code –>
As a workaround, you can:
&
with &
in the original input;<
instead of <
).Depending on the parser you're using, you can also try to find the class responsible for parsing and unescaping &
-strings, and see if you can extend it/supply your own resolver. (What I'm saying is very vague, but the specifics depend on the tools you're using.)
Building on an answer above from PSpeed the following replaceAll regex and replacement text will replace all unescaped ampersands with escaped ampersands.
String clean = xml.replaceAll( ("(&(?!amp;))", "&") );
The pattern is a negative lookahead to match on any ampersands that have not yet been escaped and the replacement string is simply an escaped ampersand. This can be optimized further for performance by using a statically compiled Pattern.
private final static Pattern unescapedAmpersands = Pattern.compile("(&(?!amp;))");
...
Matcher m = unescapedAmpersands.matcher(xml);
String xmlWithAmpersandsEscaped = m.replaceAll("&");