问题
(This is a followup to a problem I had a few days ago, where JTidy was reporting 3 errors inside a 300k HTML document, but not reporting where. After some grinding on the problem, I found what appears to be causing the error, and I have a strong suspicion of why, but I haven't decided what to do about it yet.)
Here is a small standalone HTML expression that causes JTidy to report an error:
<html>
<body>
Some text.
<script type="text/javascript">
var foo = "Press <u>ESC</u> to continue";
</script>
</body>
</html>
The Javascript string constant contains HTML tags, and these consistently throw JTidy off - remove the underline element and JTidy finishes parsing perfectly. More accurately, JTidy's parser reports an error on the closing tag; the opening tag is fine (the output might be somewhat wrong, but it was sufficient for my later purposes). The error reports even if you comment out the string:
// Any closing tags here at all will <b>throw JTidy off</b>.
I think it's safe to say that the above is valid HTML; but I can't find any documentation on what to do about it. Searching around, I find that this has been fixed in tidy-html5; it only appears to be broken in JTidy, the Java port.
Searching a bit more, I find that I am using the latest JTidy, according to its SourceForge page; version r938 is the one in my Maven repo. (Actually, the source is unpacked in a sandbox, so that I could debug this problem.) The bug report I linked above is dated 2015; JTidy r938 came out in 2009.
Am I correct in believing JTidy is handling this incorrectly? If so, should I try to fix it, or has it been addressed in some private branch? I wouldn't call myself a parser / lexer expert, but I could muddle through if I had to.
回答1:
This is indeed a bug in JTidy. Sadly, I had already fixed it (and other problems) but didn't end up making a new release, because I didn't have time to work on JTidy anymore.
The code is available in subversion, if you check out the latest revision from trunk and build it, your program should work.
I also made a branch called CodeUpdateAndJava5, in which I brought the code much closer to the behavior of the tidy tool (before they started working on the html5 version) and started adding more modern java features. That code would work too; I didn't publish any release based on it though.
Depending on what you need, the jsoup library might work better for you, and it's being maintained and updated.
来源:https://stackoverflow.com/questions/40849872/jtidy-cant-handle-html-tags-inside-script-element