I am writing a Java program to read and XML file, actually an iTunes library which is XML plist format. I have managed to get round most obstacles that this format throws up exc
Do you have an excerpt for us? Is the file itunes-generated? If so, it sounds like a bug in iTunes to me, that forgot to encode the ampersand correctly. I would not be surprised: they clearly didn't get XML in the first place, their schema of
must make the XML inventors puke.
You might want to use a different, more robust, parser. SAX is great as long as the file is well-formed. I do however not know how robust dom4j and jdom are. Just give them a try. For python, I know that I would recomment ElementTree
or BeautifulSoup
which are very robust.
Also have a look at http://code.google.com/p/xmlwise/ which I found mentioned here in stackoverflow (did you use search?).
Update: (as per updated question) You need to understand the role of entities in XML and thus SAX. They by default a separate nodes, just like text nodes. So you will likely need to join them with adjacent text nodes to get the full value. Do you use a DTD in your parser? Using a proper DTD - with entity definitions - can help parsing a lot, as it can contain mappings from entities such as &
to the characters they represent &
, and the parser may be able to do the merging for you. (At least the python XML-pull parser I like to use for large files does when materializing subtrees.)