问题
When I parse local HTML files jsoup changes quotes inside an anchor element to & obscuring my HTML.
let's assume i want to change the value "one" to "two" in the following HTML part:
<div class="pg2-txt1">
<a class="foo" appareantly_a_javascript_statement='{"targetId":"pg1-magn1", "ordinal":1}'>one</a>
</div>
what I get is:
<div class="pg2-txt1">
<a class="foo" appareantly_a_javascript_statement="{"targetId":"pg1-magn1", "ordinal":1}">two</a>
</div>
The quotes inside the anchor element are needed. My code looks like this now:
File input = new File("D:/javatest/page02.html");
Document doc = Jsoup.parse(input, "UTF-8");
Element div = doc.select("div.pg2-txt1").first(); //anchor element only identifyable by parent <div> class
div.child(0).text("one"); //actual anchor element
I tried
doc.outputSettings().prettyPrint(false);
with no success.
Can I achieve this with jsoup? Do I have to use a different parser and how would that look like.
Thank you very much in advance.
回答1:
According to the html spec JSoup behaves totally fine:
By default, SGML requires that all attribute values be delimited using either double quotation marks (ASCII decimal 34) or single quotation marks (ASCII decimal 39). Single quote marks can be included within the attribute value when the value is delimited by double quote marks, and vice versa. Authors may also use numeric character references to represent double quotes (") and single quotes ('). For double quotes authors can also use the character entity reference
"
Note the last sentence!
Basically that means, that your other software that needs the double quotes in the appareantly_a_javascript_statement
attribute is doing some incomplete parsing of its value.
I see two solutions:
1) modify the function that interprets the appareantly_a_javascript_statement value
I can't help you there, since I have no knowledge of where it is done.
2) Change the Jsoup output via regular expressions.
This is pretty hacky...
String html = doc.outerHtml();
boolean changed = false;
html = html.replaceAll("(=\"\\{)([^\"]+)(\")", "='{$2'");
do{
int oldLength = html.length();
html = html.replaceAll("(=')([^']+)(\\")([^\']+)(')", "$1$2\"$4$5");
changed = html.length() != oldLength;
}while(changed);
System.out.print(html);
来源:https://stackoverflow.com/questions/24145426/jsoup-stop-jsoup-from-making-quotes-into-amp