问题
I want to read XHTML files using SAX or StAX, whatever works best. But I don't want entities to be resolved, replaced or anything like that. Ideally they should just remain as they are. I don't want to use DTDs.
Here's an (executable, using Scala 2.8.x) example:
import javax.xml.stream._
import javax.xml.stream.events._
import java.io._
println("StAX Test - "+args(0)+"\n")
val factory = XMLInputFactory.newInstance
factory.setProperty(XMLInputFactory.SUPPORT_DTD, false)
factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false)
println("------")
val xer = factory.createXMLEventReader(new FileReader(args(0)))
val entities = new collection.mutable.ArrayBuffer[String]
while (xer.hasNext) {
val event = xer.nextEvent
if (event.isCharacters) {
print(event.asCharacters.getData)
} else if (event.getEventType == XMLStreamConstants.ENTITY_REFERENCE) {
entities += event.asInstanceOf[EntityReference].getName
}
}
println("------")
println("Entities: " + entities.mkString(", "))
Given the following xhtml file ...
<html>
<head>
<title>StAX Test</title>
</head>
<body>
<h1>Hallo StAX</h1>
<p id="html">
<div class="header">
</p>
<p id="stuff">
Überdies sollte das hier auch als Copyright sichtbar sein: ©
</p>
Das war's!
</body>
</html>
... running scala stax-test.scala stax-test.xhtml
will result in:
StAX Test - stax-test.xhtml
------
StAX Test
Hallo StAX
<div class="header">
berdies sollte das hier auch als Copyright sichtbar sein: ?
Das war's!
------
Entities: Uuml
So all entities have been replaced more or less sucessfully. What I would have expected and what I want is this, though:
StAX Test - stax-test.xhtml
------
StAX Test
Hallo StAX
<div class="header">
Überdies sollte das hier auch als Copyright sichtbar sein: ©
Das war's!
------
Entities: // well, or no entities above and instead:
// Entities: lt, quot, quot, gt, Uuml, #169
Is this even possible? I want to parse XHTML, do some modifications and then output it like that as XHTML again. So I really want the entities to remain in the result.
Also I don't get why Uuml is reported as an EntityReference event while the rest aren't.
回答1:
A bit of terminology: ũ
is a numeric character reference (not an entity), and &#auml;
is an entity reference (not an entity).
I don't think any XML parser will report numeric character references to the application - they will always be expanded. Really, your application shouldn't care about this any more than it cares about how much whitespace there is between attributes.
As for entity references, low-level parse interfaces such as SAX will report the existence of the entity reference - at any rate, it reports them when they occur in element content, but not in attribute content. There are special events notified only to the LexicalHandler rather than to the ContentHandler.
回答2:
The answer to "why Uuml is reported as an EntityReference event while the rest aren't" is that the rest are defined by the XML spec, while Ü
is specific to HTML 4.0.
Since your goal is to write modified XHTML, it may be possible to force the serializer to emit numeric entity references by setting the "encoding" to "US-ASCII" and/or the "method" to "html". The XSLT spec (which underlies Java XML serializers) says that the serializer "may output a character using a character entity reference" when the method is html. Setting the encoding to ASCII may force it to use numeric entities if named entities aren't supported.
回答3:
In Java I would use a regular expression.
public static void main(String... args) throws IOException {
BufferedReader buf = new BufferedReader(new FileReader(args[0]));
Pattern entity = Pattern.compile("&([^;]+);");
Set<String> entities = new LinkedHashSet<String>();
for (String line; (line = buf.readLine()) != null; ) {
Matcher m = entity.matcher(line);
while (m.find())
entities.add(m.group(1));
}
buf.close();
System.out.println("Entities: " + entities);
}
prints
Entities: [lt, quot, gt, Uuml, #169]
来源:https://stackoverflow.com/questions/7385914/java-read-xml-and-leave-all-entities-alone