I am working on an XML data set (the DrugBank database available here) where some fields contain escaped XML characters like "&", etc.
To make the problem more concrete, here is an example scenario:
<drugs>
<drug>
<drugbank-id>DB00001</drugbank-id>
<general-references>
# Askari AT, Lincoff AM: Antithrombotic Drug Therapy in Cardiovascular Disease. 2009 Oct; pp. 440–. ISBN 9781603272346. "Google books":http://books.google.com/books?id=iadLoXoQkWEC&pg=PA440.
</general-references>
.
</drug>
<drug>
...
</drug>
...
</drugs>
Since the entire document is huge, I am parsing it as follows:
VTDGen gen = new VTDGen();
try {
gen.setDoc(Files.readAllBytes(DRUGBANK_XML));
gen.parse(true);
} catch (IOException | ParseException e) {
SystemHelper.exitWithMessage(e, "Unable to process Drugbank XML data. Aborting.");
}
VTDNav nav = gen.getNav();
AutoPilot pilot = new AutoPilot(nav);
pilot.selectXPath("//drugs/drug");
while (pilot.evalXPath() != -1) {
long fragment = nav.getContentFragment();
String drugXML = nav.toString((int) fragment, (int) (fragment >> 32));
System.out.println(drugXML);
finerParse(drugXML); // another method handling a more detailed data analysis
}
When I tested the finerParse
method with sample xml (snippets copy-pasted from the same data), it worked fine. But when called from the above code, it failed with the error message Errors in Entity: Illegal entity char
. Upon printing the input to finerParse
(i.e., the drugXML
string), I noticed that the string &pg=PA440
in the original xml was changed to "&pg=PA440".
Why is this happening? All I am doing is parsing it using with a very well known parser.
P.S. I have found an alternate solution where I am simply passing the VTDNav as the argument to finerParse
instead of first obtaining the content string and passing that string. But I am still curious about what is going wrong with the above approach.
Instead of vtdNav.toString() use vtdNav.toRawString() the problem should go away...let me know if it works or not.
来源:https://stackoverflow.com/questions/27823107/vtd-xml-seems-to-be-spoiling-escaped-string-in-xml-document