Java XMLReader not clearing multi-byte UTF-8 encoded attributes

老子叫甜甜 提交于 2019-12-23 10:06:50

问题


I've got a really strange situation where my SAX ContentHandler is being handed bad Attributes by XMLReader. The document being parsed is UTF-8 with multi-byte characters inside XML attributes. What appears to happen is that these attributes are being accumulated each time my handler is called. So rather than being passed in succession, they get concatenated onto the previous node's value.

Here is an example which demonstrates this using public data (Wikipedia).

public class MyContentHandler extends org.xml.sax.helpers.DefaultHandler {

    public static void main(String[] args) {
        try {
            org.xml.sax.XMLReader reader = org.xml.sax.helpers.XMLReaderFactory.createXMLReader();
            reader.setContentHandler(new MyContentHandler());
            reader.parse("http://en.wikipedia.org/w/api.php?format=xml&action=query&list=allpages&apfilterredir=redirects&apdir=descending");

        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }

    public void startElement(String uri, String localName, String qName, org.xml.sax.Attributes attributes) {
        if ("p".equals(qName)) {
            String title = attributes.getValue("title");
            System.out.println(title);
        }
    }
}

Update: This complete example produces (apologies to any Cantonese speakers for the vulgar output):

𩧢
𩧢𨳒
𩧢𨳒🛅
𩧢𨳒🛅🛄
𩧢𨳒🛅🛄🛃
𩧢𨳒🛅🛄🛃🛂
𩧢𨳒🛅🛄🛃🛂🛁
𩧢𨳒🛅🛄🛃🛂🛁🛀
𩧢𨳒🛅🛄🛃🛂🛁🛀🚿
𩧢𨳒🛅🛄🛃🛂🛁🛀🚿🚾

Does anyone have any clue what is happening and how to fix it? What comes back in the document doesn't match what is happening as I debug through this snippet.


回答1:


Seems to be a bug in the JRE included version of Xerces (com.sun.org.apache.xerces.internal.parsers.SAXParser). Below are my notes.

The version bundled with JRE 1.6.0_24, v2.4.0, v2.5.0, v2.6.0 does do accumulation of Attributes.

Xerces-J v1.4.4 does not appear to have the bug.

Xerces2-J v2.6.1, v2.6.2, v2.9.0, 2.11.0 does not appear to have the bug.

You can tell by the versions tested that I was bisecting the version history. Appears to be something fixed between v2.6.0 and v2.6.1. I'm kind of surprised the JRE hasn't been updated as it was fixed in the main Xerces about 7 years ago!



来源:https://stackoverflow.com/questions/5626978/java-xmlreader-not-clearing-multi-byte-utf-8-encoded-attributes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!