I would like to unmarshall some nasty HTML to a Java object using JAXB. (I'm on Java 7).
Tagsoup is a SAX-compliant XML parser that can handle nasty HTML.
How can I setup JAXB to use Tagsoup for unmarshalling HTML?
I tried setting System.setProperty("org.xml.sax.driver", "org.ccil.cowan.tagsoup.Parser");
If I create an XMLReader, it uses Tagsoup, but not when I use JAXB.
Does com.sun.xml.internal.bind.v2.runtime.JAXBContextImpl use DOM or SAX for parsing XML?
How can I tell JAXB to use SAX?
How can I tell JAXB to use TagSoup as it's SAX implementation?
As per Blaise's suggesting, tried below, but getting SAXParseException on the last line. The parse is fine when done with the XMLReader only:
JAXBContext jaxbContext = JAXBContext.newInstance(Thing.class);
Unmarshaller unmarshaller = jaxbContext.createUnmarshaller();
XMLReader xmlReader = new org.ccil.cowan.tagsoup.Parser();
xmlReader.parse("file:///c:/test.xml");
System.out.println("parse ok");
xmlReader.setContentHandler(unmarshaller.getUnmarshallerHandler());
//SAXParseException; systemId: file:/c:/test.xml; lineNumber: 5; columnNumber: 3; The element type "br" must be terminated by the matching end-tag "</br>".
Thing thing = (Thing) unmarshaller.unmarshal(new File("c:/test.xml"));
You can get an UnmarshallerHandler
from an Unmarshaller
and set that as the ContentHandler
on your SAX parser. After you do the SAX parse obtain the object from the UnmarshallerHandler
.
UnmarshallerHandler unmarshallerHandler = unmarshaller.getUnmarshallerHandler();
xmlReader.setContentHandler(unmarshallerHandler);
xmlReader.parse(...);
Thing thing = (Thing) unmarshallerHandler.getResult();
There is an example of this on my blog:
来源:https://stackoverflow.com/questions/24791422/how-to-use-jaxb-with-html