问题
I'm using JTidy v. r938. I'm using this code to attempt to clean up a page …
final Tidy tidy = new Tidy();
tidy.setQuiet(false);
tidy.setShowWarnings(true);
tidy.setShowErrors(0);
tidy.setMakeClean(true);
Document document = tidy.parseDOM(conn.getInputStream(), null);
But when I parse this URL -- http://www.chicagoreader.com/chicago/EventSearch?narrowByDate=This+Week&eventCategory=93922&keywords=&page=1, things aren't getting cleaned up. For example, the META tags on the page, like
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
remain as
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
instead of having a "</META>" tag or appearing as "<META http-equiv="Content-Type" content="text/html; charset=UTF-8"/>". I confirm this by outputting the resulting JTidy org.w3c.dom.Document as a String.
What can I do to make JTidy truly clean up the page -- i.e. make it well-formed? I realize there are other tools out there, but this question specifically relates to using JTIdy.
回答1:
You need specify several flags to Tidy if you want XML format
private String cleanData(String data) throws UnsupportedEncodingException {
Tidy tidy = new Tidy();
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setWraplen(Integer.MAX_VALUE);
tidy.setPrintBodyOnly(true);
tidy.setXmlOut(true);
tidy.setSmartIndent(true);
ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes("UTF-8"));
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
tidy.parseDOM(inputStream, outputStream);
return outputStream.toString("UTF-8");
}
Or simply if want XHTML form
Tidy tidy = new Tidy();
tidy.setXHTML(true);
回答2:
use tidy.setXmlTags(true); to parse XML instead of HTML
回答3:
Use Tidy.setForceOutput(true)
(at your own risk) to generate the output even if errors are found.
回答4:
I parse the HTML 2 times to get well formed xml
BufferedReader br = new BufferedReader(new StringReader(str));
StringWriter sw = new StringWriter();
Tidy t = new Tidy();
t.setDropEmptyParas(true);
t.setShowWarnings(false); //to hide errors
t.setQuiet(true); //to hide warning
t.setUpperCaseAttrs(false);
t.setUpperCaseTags(false);
t.parse(br,sw);
StringBuffer sb = sw.getBuffer();
String strClean = sb.toString();
br.close();
sw.close();
//do another round of tidyness
br = new BufferedReader(new StringReader(strClean));
sw = new StringWriter();
t = new Tidy();
t.setXmlTags(true);
t.parse(br,sw);
sb = sw.getBuffer();
String strClean2 = sb.toString();
br.close();
sw.close();
来源:https://stackoverflow.com/questions/10390922/how-do-i-make-jtidy-make-html-documents-well-formed