问题
I'm using jsoup to parse HTML documents and perform some analysis on them.
After parsing, is there any way to determine whether a given attribute was enclosed in double quotes, single quotes, or no quotes?
In other words, is there any way I could distinguish the following:
Document foo = Jsoup.parse("<html><body><a name=\"value\"></body></html>");
Document bar = Jsoup.parse("<html><body><a name='value'></body></html>");
Document baz = Jsoup.parse("<html><body><a name=value></body></html>");
Ideally, Attribute
would have booleans isDoubleQuoted()
, isSingleQuoted()
, and isUnquoted()
, or similar.
It appears that Jsoup simply discards that information during parsing, which is quite sad, because I need to know for my analysis.
But maybe I'm missing something? :)
Note that I can't simply use a regex on the original string. The documents I'm analysing can be arbitrarily complex and any given attribute (i.e., key/value pair) may appear more than once within the document. Thus, it wouldn't work to simply "grep" for a key/value mapping (e.g., see if the string that I parse with jsoup contains name=value
or name='value'
or name="value"
) to find out (although that's an approximation which, though unsatisfactory, I'm probably having to go to live with until there is a better solution).
回答1:
Just in case anyone's interested: I've had a closer look into jsoup and confirmed that the information how any particular attribute's value was quoted is discarded during parsing. It is (necessarily) available during parsing, of course, but it is basically thrown away and not stored in the resulting DOM tree.
I created a pull request to add this missing functionality to jsoup: https://github.com/jhy/jsoup/pull/1114.
Not sure how good the chances are of getting a PR into jsoup. The project currently has 40 pending pull requests (including mine), the oldest one of which dates back to fall 2011 (seven years ago). On the other hand, some PRs get merged quickly. The latest merge of a PR dates back to 2 months or so ago, and that PR was merged mere days after it was submitted. Let's see. Until such time where there is a stable version of jsoup with this functionality added, I can at least use my own fork.
来源:https://stackoverflow.com/questions/51950033/can-i-use-jsoup-to-determine-whether-an-html-attribute-is-enclosed-in-single-or