How do I keep track of parsing progress of large files in StAX?

问题

I'm processing large (1TB) XML files using the StAX API. Let's assume we have a loop handling some elements:

XMLInputFactory fac = XMLInputFactory.newInstance();
 XMLStreamReader reader = fac.createXMLStreamReader(new FileReader(inputFile));
   while (true) {
       if (reader.nextTag() == XMLStreamConstants.START_ELEMENT){
            // handle contents
       }
}

How do I keep track of overall progress within the large XML file? Fetching the offset from reader works fine for smaller files:

int offset = reader.getLocation().getCharacterOffset();

but being an Integer offset, it'll probably only work for files up to 2GB...

回答1:

A simple FilterReader should work.

class ProgressCounter extends FilterReader {
    long progress = 0;

    @Override
    public long skip(long n) throws IOException {
        progress += n;
        return super.skip(n);
    }

    @Override
    public int read(char[] cbuf, int off, int len) throws IOException {
        int red = super.read(cbuf, off, len);
        progress += red;
        return red;
    }

    @Override
    public int read() throws IOException {
        int red = super.read();
        progress += red;
        return red;
    }

    public ProgressCounter(Reader in) {
        super(in);
    }

    public long getProgress () {
        return progress;
    }
}

回答2:

Seems that the Stax API can't give you a long offset.

As a workaround you could create a custom java.io.FilterReader class which overrides read() and read(char[] cbuf, int off, int len) to increment a long offset.

You would pass this reader to the XMLInputFactory. The handler loop can then get the offset information directly from the reader.

You could also do this on the byte-level reading using a FilterInputStream, counting the byte offset instead of character offset. That would allow for a exact progress calculation given the file size.

来源：https://stackoverflow.com/questions/34724494/how-do-i-keep-track-of-parsing-progress-of-large-files-in-stax

标签

java

xml

stax