问题
I'm processing large (1TB) XML files using the StAX API. Let's assume we have a loop handling some elements:
XMLInputFactory fac = XMLInputFactory.newInstance();
XMLStreamReader reader = fac.createXMLStreamReader(new FileReader(inputFile));
while (true) {
if (reader.nextTag() == XMLStreamConstants.START_ELEMENT){
// handle contents
}
}
How do I keep track of overall progress within the large XML file? Fetching the offset from reader works fine for smaller files:
int offset = reader.getLocation().getCharacterOffset();
but being an Integer offset, it'll probably only work for files up to 2GB...
回答1:
A simple FilterReader
should work.
class ProgressCounter extends FilterReader {
long progress = 0;
@Override
public long skip(long n) throws IOException {
progress += n;
return super.skip(n);
}
@Override
public int read(char[] cbuf, int off, int len) throws IOException {
int red = super.read(cbuf, off, len);
progress += red;
return red;
}
@Override
public int read() throws IOException {
int red = super.read();
progress += red;
return red;
}
public ProgressCounter(Reader in) {
super(in);
}
public long getProgress () {
return progress;
}
}
回答2:
Seems that the Stax API can't give you a long
offset.
As a workaround you could create a custom java.io.FilterReader
class which overrides read()
and read(char[] cbuf, int off, int len)
to increment a long
offset.
You would pass this reader to the XMLInputFactory
.
The handler loop can then get the offset information directly from the reader.
You could also do this on the byte-level reading using a FilterInputStream
, counting the byte offset instead of character offset. That would allow for a exact progress calculation given the file size.
来源:https://stackoverflow.com/questions/34724494/how-do-i-keep-track-of-parsing-progress-of-large-files-in-stax