问题
I have an html file that contains these tags at the top:
<?xml version="1.0" encoding="windows-1252"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xml:lang="fi" lang="fi" xmlns="http://www.w3.org/1999/xhtml">
<head>
An exception is occuring when i try to use a SaxParser to parse the Html file saying that some character at a specified line and column is invalid when i use this code:
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
InputSource is = new InputSource(new FileInputStream(file));
parser.parse(is, this);
if i specify the encoding with this: is.setEncoding("ISO-8859-1");
, the exception does not occur.
Why do i have to explicitly tell the SaxParser which encoding it should use? can't the SaxParser detect the encoding from the bytestream or the tag in the beginning of the html file?
Also, the docs say:
"If there is no character stream, but there is a byte stream, the parser will use that byte stream, using the encoding specified in the InputSource or else (if no encoding is specified) autodetecting the character encoding using an algorithm such as the one in the XML specification"
But this is not true! Looking further in the java code i see this:
/*
* TODO: Let Expat try to guess the encoding instead of defaulting.
* Unfortunately, I don't know how to tell which encoding Expat picked,
* so I won't know how to encode "<externalEntity>" below. The solution
* I think is to fix Expat to not require the "<externalEntity>"
* workaround.
*/
this.encoding = encoding == null ? DEFAULT_ENCODING : encoding;
this.pointer = initialize(
this.encoding,
processNamespaces
);
Is there no algorithm for detecting xml encoding?
来源:https://stackoverflow.com/questions/54711709/does-the-saxparser-detect-xml-encoding