org.xml.sax.SAXParseException: The reference to entity “T” must end with the ';' delimiter

前端 未结 9 1606
北海茫月
北海茫月 2020-12-29 07:35

I am trying to parse an XML file whcih contains some special characters like \"&\" using DOM parser. I am getting the saxparse exception \"the reference to entity must e

相关标签:
9条回答
  • 2020-12-29 07:42

    In complement of @PSpeed's answer, here is a complete solution (SAX parser):

        try {
    
            InputStream xmlStreamToParse = blob.getBinaryStream();
    
            // Clean
            BufferedReader br = new BufferedReader(new InputStreamReader(xmlStreamToParse));
    
            StringBuilder sb = new StringBuilder();
    
            String line;
            while ((line = br.readLine()) != null) {
                sb.append(line.replaceAll("&([^;]+(?!(?:\\w|;)))", "&$1")); // or whatever you want to clean
            }
    
            InputStream stream = org.apache.commons.io.IOUtils.toInputStream(sb.toString(), "UTF-8");
    
            // Parsing
            SAXParserFactory saxFactory = SAXParserFactory.newInstance();
            saxFactory.setNamespaceAware(true);
            SAXParser theParser = saxFactory.newSAXParser();
            XMLReader xmlReader = theParser.getXMLReader();
            LicenceXMLHandler licence = new LicenceXMLHandler();
            xmlReader.setContentHandler(licence);
            xmlReader.parse(new InputSource(stream));
    
        } catch (SQLException | SAXException | IOException | ParserConfigurationException e) {
            log.error("Error: " + e);
        }
    

    Explanations:

    • Transform the Blob into an InputStream
    • Clean the Blob
    • Parse the file (LicenceXMLHandler is the parser class)
    0 讨论(0)
  • 2020-12-29 07:46

    Simply replace your & with & and it will work.

    0 讨论(0)
  • 2020-12-29 07:47

    Your input is invalid XML. Specifically, you cannot have an '&' character in an attribute value unless it is part of a well-formed character entity reference.

    AFAIK, you have two choices:

    • Write a "not exactly XML" parser yourself. I seriously doubt that you will find an existing one. Any self-respecting XML parser will reject invalid input.
    • Fix whatever is creating this (so-called) XML so that it doesn't put random '&' characters in places where they are not allowed. It's quite simple really. As you are building the XML, replace the '&' character that is not already part of a character reference with '&'
    0 讨论(0)
  • 2020-12-29 07:52

    Some of you might be familiar with the ERROR “The reference to entity XX must end with the ‘;’ delimiter” while adding or altering any piece of code to your XML Templates. Even I get that ERROR sometimes when I try to alter or add some codes to my blogger blog’s templates(XML).

    Mostly these kind of ERRORS occur while we add any third-party banner or widgets to our XML Templates. We can easily rectify that ERROR by making a slight alteration in the piece of code we add!

    Just replace “&” with “&” in your HTML/Javascript code!
    

    EXAMPLE

    Original Code:
    <!– Begin Code –>
    <script src="http://XXXXXX.com/XXX.php?sid=XXX&br=XXX&dk=XXXXXXXXXXXX" type="text/javascript"/>
    <!– End Code –>
    
    Altered Code:
    
    <!– Begin Code –>
    <script src="http://XXXXXX.com/XXX.php?sid=XXX&amp;br=XXX&amp;dk=XXXXXXXXXXXX" type="text/javascript"/>
    <!– End Code –>
    
    0 讨论(0)
  • 2020-12-29 07:52

    As a workaround, you can:

    1. Replace all the occurrences of & with &amp; in the original input;
    2. Parse it;
    3. In your code that handles the result, handle the case where you now get escaped characters (e.g. &lt; instead of <).

    Depending on the parser you're using, you can also try to find the class responsible for parsing and unescaping &-strings, and see if you can extend it/supply your own resolver. (What I'm saying is very vague, but the specifics depend on the tools you're using.)

    0 讨论(0)
  • 2020-12-29 07:57

    Building on an answer above from PSpeed the following replaceAll regex and replacement text will replace all unescaped ampersands with escaped ampersands.

    String clean = xml.replaceAll( ("(&(?!amp;))", "&amp;") );
    

    The pattern is a negative lookahead to match on any ampersands that have not yet been escaped and the replacement string is simply an escaped ampersand. This can be optimized further for performance by using a statically compiled Pattern.

    private final static Pattern unescapedAmpersands = Pattern.compile("(&(?!amp;))");
    
    ...
    
    Matcher m = unescapedAmpersands.matcher(xml);
    String xmlWithAmpersandsEscaped = m.replaceAll("&amp;");
    
    0 讨论(0)
提交回复
热议问题