I have XML that I need to parse but have no control over the creation of. Unfortunately it\'s not very strict XML and contains things like:
This
Use libraries such as tidy
or tagsoup
.
TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short.
If it's not valid XML (like the above) then no XML parser will handle it (as you've identified). If you know the scope of the errors (such as the above entity issue), then the simplest solution may be to run a correcting process over it (fixing entities such as inserting entities) and then feed it to an existing parser.
Otherwise you'll have to code one yourself with built-in support for such anomalies. And I can't believe that's anything other than a tedious and error-prone task.
I believe JSoup can handle badly formed XML