Java : HTML Parsing

后端未结

关注

 4  952

I am having HTML contents as given below. The tag that i am looking out for here are \"img src\" and \"!important\". Does Java provide any HTML par

相关标签:

4条回答

青春惊慌失措

2020-12-22 05:29

Try NekoHtml. This is the HTML parsing library used by various higher-level testing frameworks such as HtmlUnit.

NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.

0 讨论(0)
发布评论:

提交评论
- 加载中...
闹比i

2020-12-22 05:36
I used jsoup - this library have nice selector syntax (http://jsoup.org/cookbook/extracting-data/selector-syntax), and for your problem you can use code like this:
```
File input = new File("input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Elements pngs = doc.select("img[src$=.png]");
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
隐瞒了意图╮

2020-12-22 05:44

I like using Jericho: http://jericho.htmlparser.net/docs/index.html

It is invulnerable to bad formed html, links leading to unavailable locations etc.

There's a lot of examples on their page, you just get all IMG tags and analyze their attributes to extracts those that pass your needs.

0 讨论(0)
发布评论:

提交评论
- 加载中...

一向

2020-12-22 05:49

String value = Jsoup.parse(new File("d:\\1.html"), "UTF-8").select("img").attr("src");
System.out.println(value); //http://socialimg04.rediff.com/image.php?uid=32868898&type=thumb
System.out.println(Jsoup.parse(new File("d:\\1.html"), "UTF-8").select("span[style$=important;]").first().text());//android se updates...

JSoup
What-are-the-pros-and-cons-of-the-leading-java-html-parsers

0 讨论(0)