Reading HTML file to DOM tree using Java

后端未结

关注

 6  1252

Is there a parser/library which is able to read an HTML document into a DOM tree using Java? I\'d like to use the standard DOM/Xpath API that Java provides.

相关标签:

6条回答

自闭症患者

2020-11-30 03:53
Use https://jsoup.org , this is very simple and power.can read and change a html.

Sample:
```
Document doc = Jsoup.parse(page);  //page can be a file or string.
Element main = doc.getElementById("MainView");
Elements links = doc.select(".link");
```
for create elements can use j2html, https://j2html.com
0 讨论(0)
发布评论:

提交评论
- 加载中...
抹茶落季

2020-11-30 03:59

TagSoup can do what you want.

0 讨论(0)
发布评论:

提交评论
- 加载中...
被撕碎了的回忆

2020-11-30 04:00
Since HTML files are generally problematic, you'll need to first clean them up using a parser/scanner. I've used JTidy but never happily. NekoHTML works okay, but any of these tools are always just making a best guess of what is intended. You're effectively asking to let a program alter a document's markup until it conforms to a schema. That will likely cause structural (markup), style or content loss. It's unavoidable, and you won't really know what's missing unless you manually scan via a browser (and then you have to trust the browser too).

It really depends on your purpose — if you have thousands of ugly documents with tons of extraneous (non-HTML) markup, then a manual process is probably unreasonable. If your goal is accuracy on a few important documents, then manually fixing them is a reasonable proposition.

One approach is the manual process of repeatedly passing the source through a well-formed and/or validating parser, in an edit cycle using the error messages to eventually fix the broken markup. This does require some understanding of XML, but that's not a bad education to undertake.

With Java 5 the necessary XML features — called the JAXP API — are now built into Java itself; you don't need any external libraries.

You first obtain an instance of a DocumentBuilderFactory, set its features, create a DocumentBuilder (parser), then call its parse() method with an InputSource. InputSource has a number of possible constructors, with a StringReader used in the following example:
```
import javax.xml.parsers.*;
// ...

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);
dbf.setNamespaceAware(true);
dbf.setIgnoringComments(false);
dbf.setIgnoringElementContentWhitespace(false);
dbf.setExpandEntityReferences(false);
DocumentBuilder db = dbf.newDocumentBuilder();
return db.parse(new InputSource(new StringReader(source)));
```
This returns a DOM Document. If you don't mind using external libraries there's also the JDOM and XOM APIs, and while these have some advantages over the SAX and DOM APIs in JAXP, they do require non-Java libraries to be added. The DOM can be somewhat cumbersome, but after so many years of using it I don't really mind any longer.
0 讨论(0)
发布评论:

提交评论
- 加载中...
遇见更好的自我

2020-11-30 04:00

Here is a link that might be useful. It's a list of Open Source HTML Parser in Java Open Source HTML Parsers in Java

0 讨论(0)
发布评论:

提交评论
- 加载中...
深忆病人

2020-11-30 04:01

JTidy, either by processing the stream to XHTML then using your favourite DOM implementation to re-parse, or using parseDOM if the limited DOM imp that gives you is enough.

Alternatively Neko.

0 讨论(0)
发布评论:

提交评论
- 加载中...
不要未来只要你来

2020-11-30 04:01

Apache's Xerces2 parser should do what you want.

0 讨论(0)
发布评论:

提交评论
- 加载中...