Can I use jsoup to determine whether an HTML attribute is enclosed in single or double quotes (or none)?

百般思念 提交于 2019-12-11 02:58:51

问题


I'm using jsoup to parse HTML documents and perform some analysis on them.

After parsing, is there any way to determine whether a given attribute was enclosed in double quotes, single quotes, or no quotes?

In other words, is there any way I could distinguish the following:

Document foo = Jsoup.parse("<html><body><a name=\"value\"></body></html>");
Document bar = Jsoup.parse("<html><body><a name='value'></body></html>");
Document baz = Jsoup.parse("<html><body><a name=value></body></html>");

Ideally, Attribute would have booleans isDoubleQuoted(), isSingleQuoted(), and isUnquoted(), or similar.

It appears that Jsoup simply discards that information during parsing, which is quite sad, because I need to know for my analysis.

But maybe I'm missing something? :)

Note that I can't simply use a regex on the original string. The documents I'm analysing can be arbitrarily complex and any given attribute (i.e., key/value pair) may appear more than once within the document. Thus, it wouldn't work to simply "grep" for a key/value mapping (e.g., see if the string that I parse with jsoup contains name=value or name='value' or name="value") to find out (although that's an approximation which, though unsatisfactory, I'm probably having to go to live with until there is a better solution).


回答1:


Just in case anyone's interested: I've had a closer look into jsoup and confirmed that the information how any particular attribute's value was quoted is discarded during parsing. It is (necessarily) available during parsing, of course, but it is basically thrown away and not stored in the resulting DOM tree.

I created a pull request to add this missing functionality to jsoup: https://github.com/jhy/jsoup/pull/1114.

Not sure how good the chances are of getting a PR into jsoup. The project currently has 40 pending pull requests (including mine), the oldest one of which dates back to fall 2011 (seven years ago). On the other hand, some PRs get merged quickly. The latest merge of a PR dates back to 2 months or so ago, and that PR was merged mere days after it was submitted. Let's see. Until such time where there is a stable version of jsoup with this functionality added, I can at least use my own fork.



来源:https://stackoverflow.com/questions/51950033/can-i-use-jsoup-to-determine-whether-an-html-attribute-is-enclosed-in-single-or

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!