问题
The XMLStreamReader
->Location has a method called getCharacterOffset()
.
Unfortunately the Javadocs indicate this is an ambigously named method: it can also return a byte-offset (and this appears to be true in practice); unhelpfully this seems to occur when reading from files (for instance):
The Javadoc states :
Return the byte or character offset into the input source this location is pointing to. If the input source is a file or a byte stream then this is the byte offset into that stream, but if the input source is a character media then the offset is the character offset. (emphasis added)
I really need the character offset; and I'm pretty sure I'm being given the byte offset instead.
The (UTF-8 encoded) XML is contained in a (partially corrupt 1G) file. [Hence the need to use a lower-level API which doesn't complain about the lack of well-formedness until it really has no choice but to].
Question
What does the Javadoc mean when it says '...input source is a character media...' : how can I force it to think of my input file as 'character media' - so that I get an accurate (Character) offset rather than a byte offset?
Extra blah blah:
[ I'm pretty sure this is what is going on - when I strip the file apart (using certain known high-level tags) I get a few characters missing or extra - in a non-accumalating way - I'm putting the difference down to a few multi-byte characters throwing off the counter: also when I copy (using 'head'/'tail' for instance in Powershell - this tool appears to correctly recognize [or assume UTF-8] and does a good conversion to UTF-16 as far as I can see ]
回答1:
The offset is in units of the underlying Source
.
The XMLStreamReader
only knows how many units it has read from the Source
so the offset is calculated in those units.
A Stream
works in units of byte
and therefore you end up with a byte
offset.
A Reader
works in units of char
and therefore you end up with an offset in char
.
The docs for StreamSource are more explicit in what the terms "character media" means.
Maybe try something like
final Source source = new StreamSource(new InputStreamReader(new FileInputStream(new File("my.xml")), "UTF-8"));
final XMLStreamReader xmlReader = XMLInputFactory.newFactory().createXMLStreamReader(source);
回答2:
XMLInputFactory.createXMLStreamReader(java.io.InputStream)
is a byte stream
XMLInputFactory.createXMLStreamReader(java.io.Reader)
is a character stream
来源:https://stackoverflow.com/questions/15974196/xmlstreamreader-get-character-offset-xml-from-file