org.xml.sax.SAXParseException: The reference to entity “T” must end with the ';' delimiter

前端 未结 9 1607
北海茫月
北海茫月 2020-12-29 07:35

I am trying to parse an XML file whcih contains some special characters like \"&\" using DOM parser. I am getting the saxparse exception \"the reference to entity must e

相关标签:
9条回答
  • 2020-12-29 07:59

    I'm not sure I understand the question. As far as I'm aware, unless you're inside a CDATA, naked & characters without a closing ; are invalid.

    If that's not the case for your XML file, then it's invalid, and you'll need to find another way of parsing it, or fixing it before SAX gets a hold of it.

    If I'm misunderstanding something here, you should probably post a sample of the actual XML so we can hep further.

    Update:

    It looks like:

    Figure ActualText="&T "
    

    is the offending line. Is this section within a CDATA or not? If not, this is not valid XML and you should not expect SAX to be able to handle it.

    You'll need to either:

    • change the application that created it; or
    • fix it before it's loaded by SAX (if you can't change that application) to something like "Figure ActualText="&T ""; or
    • find a non-SAX method for parsing.
    0 讨论(0)
  • 2020-12-29 07:59

    It will work if you use below command before publishing.

    please put your xml file name in below command

    sed -i "s/&/;/g" *.xml
    
    0 讨论(0)
  • 2020-12-29 08:01

    As others have stated, your XML is definitely invalid. However, if you can't change the generating application and can add a cleaning step then the following should clean up the XML:

    String clean = xml.replaceAll( "&([^;]+(?!(?:\\w|;)))", "&$1" );
    

    What that regex is doing is looking for any badly formed entity references and escaping the ampersand.

    Specifically, (?!(?:\\w|;)) is a negative look-ahead that makes that match stop at anything that is not a word character (a-z,0-9) and not a semi-colon. So the whole regex grabs everything from the & that is not a ; up until the first non-word, non-semi-colon character.

    It puts everything except the ampersand in the first capture group so that it can be referred to in the replace string. That's the $1.

    Note that this won't fix references that look like they are valid but aren't. For example, if you had &T; that would throw a different kind of error altogether unless the XML actually defines the entity.

    0 讨论(0)
提交回复
热议问题