问题
Is there a way to accurately gather the byte offsets of xml tags using the XMLStreamReader?
I have a large xml file that I require random access to. Rather than writing the whole thing to a database, I would like to run through it once with an XMLStreamReader to gather the byte offsets of significant tags, and then be able to use a RandomAccessFile to retrieve the tag content later.
XMLStreamReader doesn't seem to have a way to track character offsets. Instead people recommend attaching the XmlStreamReader to a reader that tracks how many bytes have been read (the CountingInputStream provided by apache.commons.io, for example)
e.g:
CountingInputStream countingReader = new CountingInputStream(new FileInputStream(xmlFile)) ;
XMLStreamReader xmlStreamReader = xmlStreamFactory.createXMLStreamReader(countingReader, "UTF-8") ;
while (xmlStreamReader.hasNext()) {
int eventCode = xmlStreamReader.next();
switch (eventCode) {
case XMLStreamReader.END_ELEMENT :
System.out.println(xmlStreamReader.getLocalName() + " @" + countingReader.getByteCount()) ;
}
}
xmlStreamReader.close();
Unfortunately there must be some buffering going on, because the above code prints out the same byte offsets for several tags. Is there a more accurate way of tracking byte offsets in xml files (ideally without resorting to abandoning proper xml parsing)?
回答1:
You could use getLocation() on the XMLStreamReader (or XMLEvent.getLocation() if you use XMLEventReader), but I remember reading somewhere that it is not reliable and precise. And it looks like it gives the endpoint of the tag, not the starting location.
I have a similar need to precisely know the location of tags within a file, and I'm looking at other parsers to see if there is one that guarantees to give the necessary level of location precision.
回答2:
You could use a wrapper input stream around the actual input stream, simply deferring to the wrapped stream for actual I/O operations but keeping an internal counting mechanism with assorted code to retrieve current offset?
回答3:
Unfortunatly Aalto doesn't implement the LocationInfo interface.
The last java VTD-XML ximpleware implementation, currently 2.11 http://sourceforge.net/projects/vtd-xml/files/vtd-xml/ provides some code maintaning a byte offset after each call to the getChar() method of its IReader implementations.
IReader implementations for various caracter encodings are available inside VTDGen.java and VTDGenHuge.java
IReader implementations are provided for the following encodings
ASCII;
ISO_8859_1
ISO_8859_10
ISO_8859_11
ISO_8859_12
ISO_8859_13
ISO_8859_14
ISO_8859_15
ISO_8859_16
ISO_8859_2
ISO_8859_3
ISO_8859_4
ISO_8859_5
ISO_8859_6
ISO_8859_7
ISO_8859_8
ISO_8859_9
UTF_16BE
UTF_16LE
UTF8;
WIN_1250
WIN_1251
WIN_1252
WIN_1253
WIN_1254
WIN_1255
WIN_1256
WIN_1257
WIN_1258
Updating IReader with a getCharOffset() method and implementing it by adding a charCount member along to the offset member of the VTDGen and VTDGenHuge classes and by incrementing it upon each getChar() and skipChar() call of each IReader implementation should give you the start of a solution.
回答4:
I think I've found another option. If you replace your switch
block with the following, it will dump the position immediately after the end element tag.
switch (eventCode) {
case XMLStreamReader.END_ELEMENT :
System.out.println(xmlStreamReader.getLocalName() + " end@" + xmlStreamReader.getLocation().getCharacterOffset()) ;
}
This solution also would require that the actual start position of the end tags would have to be manually calculated, and would have the advantage of not needing an external JAR file.
I was not able to track down some minor inconsistencies in the data management (I think it has to do with how I initialized my XMLStreamReader
), but I always saw a consistent increase in the location as the reader moved through the content.
Hope this helps!
回答5:
I recently worked out a solution for a similar question on How to find character offsets in big XML files using java?. I think it provides a good solution based on a ANTLR generated XML-Parser.
来源:https://stackoverflow.com/questions/3176610/java-gathering-byte-offsets-of-xml-tags-using-an-xmlstreamreader