I am writing a Java program to read and XML file, actually an iTunes library which is XML plist format. I have managed to get round most obstacles that this format throws up exc
Do you have an excerpt for us? Is the file itunes-generated? If so, it sounds like a bug in iTunes to me, that forgot to encode the ampersand correctly. I would not be surprised: they clearly didn't get XML in the first place, their schema of <name>[key]</name><string>[value]</string>
must make the XML inventors puke.
You might want to use a different, more robust, parser. SAX is great as long as the file is well-formed. I do however not know how robust dom4j and jdom are. Just give them a try. For python, I know that I would recomment ElementTree
or BeautifulSoup
which are very robust.
Also have a look at http://code.google.com/p/xmlwise/ which I found mentioned here in stackoverflow (did you use search?).
Update: (as per updated question) You need to understand the role of entities in XML and thus SAX. They by default a separate nodes, just like text nodes. So you will likely need to join them with adjacent text nodes to get the full value. Do you use a DTD in your parser? Using a proper DTD - with entity definitions - can help parsing a lot, as it can contain mappings from entities such as &
to the characters they represent &
, and the parser may be able to do the merging for you. (At least the python XML-pull parser I like to use for large files does when materializing subtrees.)
There is something fishy about what you are trying to do.
If the file format you are trying to parse contains bare ampersand (&
) characters then it is not well-formed XML. Ampersands are represented as character entities (e.g. &
) in well-formed XML.
If it is really supposed to be real XML, then there is a bug in whatever wrote / generated the file.
If it is not supposed to be real XML (i.e. those ampersands are not a mistake), then you probably shouldn't by trying to parse it using an XML parser.
Ah, I see. The XML is actually correctly encoded, but you didn't get the SO markup right.
It would appear that your real problem is that your characters(...)
callback is being called separately for the text before the &
, for the (decoded) &
, and finally for the text after the &
. You simply have to have to deal with this by joining the text chunks back together.
The javadoc for ContentHandler.characters() says this:
"The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks ...".
It's probably not the best general solution for escape characters, but I only had to take into account new lines so it was easy to just check for \n.
You could check for the backslash \ only to check for all escape characters or in your case &, although I think others will come with more elegant solutions.
@Override
public void characters(char[] ch, int start, int length)
{
String elementData = new String(ch, start, length);
boolean elementDataContainsNewLine = (elementData.indexOf("\n") != -1);
if (!elementDataContainsNewLine)
{
//do what you want if it is no new line
}
}
I am parsing the below string using SAXParser
<xml>
<FirstTag>&<</FirstTag>
<SecondTag>test</SecondTag>
</xml>
I want the same string to be retained but it is getting converted to below
<xml>
<FirstTag>&<</FirstTag>
<SecondTag>test</SecondTag>
<xml>
Here is my code. How can I avoid this being converted?
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
MyHandler handler = new MyHandler();
values = handler.getValues();
saxParser.parse(x, handler);