Possible to parse a HTML document and build a DOM tree(java)

后端未结

关注

 5  682

Is it possible and what tools could be used to parse an html document as a string or from a file and then to construct a DOM tree so that a developer can walk the tree throu

相关标签:

5条回答

时光说笑

2021-01-07 08:02

HTML Parser seems to support conversion from HTML to XML. Then you can build a DOM tree using the usual Java toolchain.

0 讨论(0)
发布评论:

提交评论
- 加载中...
离开以前

2021-01-07 08:03
You can use TagSoup - it is a SAX Compliant parser that can clean malformed content such as HTML from generic web pages into well-formed XML.
```
This is bold, bold italic, italic, normal text

gets correctly rewritten as:

This is bold, bold italic, italic, normal text.
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
Happy的楠姐

2021-01-07 08:14
JTidy should let you do what you want.

Usage is fairly straight forward, but parsing is configurable. e.g.:
```
InputStream in = ...;
Tidy tidy = new Tidy();
// configure Tidy instance as required
...
...
Document doc = tidy.parseDOM(in, null);
Element root = doc.getDocumentElement();
```
The JavaDoc is hosted here.
0 讨论(0)
发布评论:

提交评论
- 加载中...
慢半拍i

2021-01-07 08:17

You can take a look at NekoHTML, a Java library that performs a best effort cleaning and tag balancing in your document. It is an easy way to parse a malformed HTML (or a non-valid XML) file.

It is distributed under the Apache 2.0 license.

0 讨论(0)
发布评论:

提交评论
- 加载中...
眼角桃花

2021-01-07 08:20

There are several open source tools to parse HTML from Java.

Check http://java-source.net/open-source/html-parsers

Also you can check answers to this question: Reading HTML file to DOM tree using Java It is almost the same...

0 讨论(0)
发布评论:

提交评论
- 加载中...