XMLStreamReader: get character offset : XML from file

余生长醉 提交于 2019-12-13 14:25:26

问题


The XMLStreamReader->Location has a method called getCharacterOffset().

Unfortunately the Javadocs indicate this is an ambigously named method: it can also return a byte-offset (and this appears to be true in practice); unhelpfully this seems to occur when reading from files (for instance):

The Javadoc states :

Return the byte or character offset into the input source this location is pointing to. If the input source is a file or a byte stream then this is the byte offset into that stream, but if the input source is a character media then the offset is the character offset. (emphasis added)

I really need the character offset; and I'm pretty sure I'm being given the byte offset instead.

The (UTF-8 encoded) XML is contained in a (partially corrupt 1G) file. [Hence the need to use a lower-level API which doesn't complain about the lack of well-formedness until it really has no choice but to].

Question

What does the Javadoc mean when it says '...input source is a character media...' : how can I force it to think of my input file as 'character media' - so that I get an accurate (Character) offset rather than a byte offset?

Extra blah blah:

[ I'm pretty sure this is what is going on - when I strip the file apart (using certain known high-level tags) I get a few characters missing or extra - in a non-accumalating way - I'm putting the difference down to a few multi-byte characters throwing off the counter: also when I copy (using 'head'/'tail' for instance in Powershell - this tool appears to correctly recognize [or assume UTF-8] and does a good conversion to UTF-16 as far as I can see ]


回答1:


The offset is in units of the underlying Source.

The XMLStreamReader only knows how many units it has read from the Source so the offset is calculated in those units.

A Stream works in units of byte and therefore you end up with a byte offset.

A Reader works in units of char and therefore you end up with an offset in char.

The docs for StreamSource are more explicit in what the terms "character media" means.

Maybe try something like

final Source source = new StreamSource(new InputStreamReader(new FileInputStream(new File("my.xml")), "UTF-8"));
final XMLStreamReader xmlReader = XMLInputFactory.newFactory().createXMLStreamReader(source);



回答2:


XMLInputFactory.createXMLStreamReader(java.io.InputStream) is a byte stream

XMLInputFactory.createXMLStreamReader(java.io.Reader) is a character stream



来源:https://stackoverflow.com/questions/15974196/xmlstreamreader-get-character-offset-xml-from-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!