Possible to parse a HTML document and build a DOM tree(java)

后端 未结 5 682
孤街浪徒
孤街浪徒 2021-01-07 07:54

Is it possible and what tools could be used to parse an html document as a string or from a file and then to construct a DOM tree so that a developer can walk the tree throu

相关标签:
5条回答
  • 2021-01-07 08:02

    HTML Parser seems to support conversion from HTML to XML. Then you can build a DOM tree using the usual Java toolchain.

    0 讨论(0)
  • 2021-01-07 08:03

    You can use TagSoup - it is a SAX Compliant parser that can clean malformed content such as HTML from generic web pages into well-formed XML.

    This is <B>bold, <I>bold italic, </b>italic, </i>normal text
    
    gets correctly rewritten as:
    
    This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.
    
    0 讨论(0)
  • 2021-01-07 08:14

    JTidy should let you do what you want.

    Usage is fairly straight forward, but parsing is configurable. e.g.:

    InputStream in = ...;
    Tidy tidy = new Tidy();
    // configure Tidy instance as required
    ...
    ...
    Document doc = tidy.parseDOM(in, null);
    Element root = doc.getDocumentElement();
    

    The JavaDoc is hosted here.

    0 讨论(0)
  • 2021-01-07 08:17

    You can take a look at NekoHTML, a Java library that performs a best effort cleaning and tag balancing in your document. It is an easy way to parse a malformed HTML (or a non-valid XML) file.

    It is distributed under the Apache 2.0 license.

    0 讨论(0)
  • 2021-01-07 08:20

    There are several open source tools to parse HTML from Java.

    Check http://java-source.net/open-source/html-parsers

    Also you can check answers to this question: Reading HTML file to DOM tree using Java It is almost the same...

    0 讨论(0)
提交回复
热议问题