问题
I am trying to parse content of HTML table and write it to CSV.
I am trying StaX parser
The html contains escaped characters like &nbps'
and &
I am using org.apache.commons.lang3.StringEscapeUtils
to usescape the html line by line and write to a new file.
StAX still fails to parse the unescaped characters.
Please help me fix or handle this exception.
I test with below xml fragment -
<root><element>A B </element></root>
I call below code to unescape html -
StringEscapeUtils.unescapeHtml4(escapedHtml)
and write it to a file.
I then try to parse that file using Stax Parser -
public void unescapeHtmlFile(String filePath) throws IOException{
BufferedReader fileReader = null;
BufferedWriter fileWriter = null;
try{
fileReader = new BufferedReader(new FileReader(filePath));
fileWriter = new BufferedWriter(new FileWriter("./out/UnescapedHtml.html"));
String line = null;
String unescapedLine = null;
while((line=fileReader.readLine())!=null){
System.out.println("Before: " + line);
unescapedLine = StringEscapeUtils.unescapeHtml4(line);
System.out.println("After: " + unescapedLine);
fileWriter.newLine();
fileWriter.write(unescapedLine);
}
}finally{
fileReader.close();
fileWriter.close();
}
}
And the output is below-
Document started
<?xml version="null" encoding='UTF-8' standalone='no'?>
Element started
<root>
Element started
<element0>
Characters
0123456 7890 ABC DEF
Element ended
</element0>
Element started
<element1>
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,66]
Message: Invalid byte 1 of 1-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:596)
at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(XMLEventReaderImpl.java:83)
at parser.StreamParserTest.main(StreamParserTest.java:30)
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,66]
Message: XML document structures must start and end within the same entity.
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:598)
at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(XMLEventReaderImpl.java:83)
at parser.StreamParserTest.main(StreamParserTest.java:30)
It fails to parse the unescaped value of
Please help.
回答1:
The classes FileReader and FileWriter are old utility classes, that unfortunately use the current platform encoding. On Windows almost certainly not UTF-8. And XML in general is in UTF-8 (which indeed can represent all characters.
fileReader = new BufferedReader(new FileReader(filePath));
fileWriter = new BufferedWriter(new FileWriter("./out/UnescapedHtml.html"));
should be
fileReader = new BufferedReader(new InputStreamReader(
new FileInputStream(filePath), StandardCharsets.UTF_8));
fileWriter = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("./out/UnescapedHtml.html"),
StandardCharsets.UTF_8));
To be entirely honest, one should read <?xml ...?>
and look whether it has an encoding
attribute for the charset, default is UTF-8. That could be done with StandardCharsets.ISO_8859_1
, as UTF-8 stumbles over wrong multi-byte sequences.
Using StandardCharsets instead of Strings "UTF-8" does away with
- an UnsupportedEncodingException to handle,
- a magic constant.
The StandardCharsets are guaranteed to be supported.
来源:https://stackoverflow.com/questions/21552315/characters-generated-by-apache-commons-stringescapeutils-unescapehtml-cannnot-be