How can I parse a HTML string in Java?

后端 未结 6 547
再見小時候
再見小時候 2021-01-04 07:50

Given the string \"

Hello World!
\", what is the (easiest) way to get a DOM Element represent
相关标签:
6条回答
  • 2021-01-04 08:00

    I found this somewhere (don't remember where):

     public static DocumentFragment parseXml(Document doc, String fragment)
     {
        // Wrap the fragment in an arbitrary element.
        fragment = "<fragment>"+fragment+"</fragment>";
        try
        {
            // Create a DOM builder and parse the fragment.
            DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
            Document d = factory.newDocumentBuilder().parse(
                    new InputSource(new StringReader(fragment)));
    
            // Import the nodes of the new document into doc so that they
            // will be compatible with doc.
            Node node = doc.importNode(d.getDocumentElement(), true);
    
            // Create the document fragment node to hold the new nodes.
            DocumentFragment docfrag = doc.createDocumentFragment();
    
            // Move the nodes into the fragment.
            while (node.hasChildNodes())
            {
                docfrag.appendChild(node.removeChild(node.getFirstChild()));
            }
            // Return the fragment.
            return docfrag;
        }
        catch (SAXException e)
        {
            // A parsing error occurred; the XML input is not valid.
        }
        catch (ParserConfigurationException e)
        {
        }
        catch (IOException e)
        {
        }
        return null;
    }
    
    0 讨论(0)
  • 2021-01-04 08:02

    I've used Jericho HTML Parser it's OSS, detects(forgives) badly formatted tags and is lightweight

    0 讨论(0)
  • 2021-01-04 08:03

    you could use HTML Parser, which a Java library used to parse HTML in either a linear or nested fashion. It is an open source tool and can be found on SourceForge

    0 讨论(0)
  • 2021-01-04 08:12

    If you have a string which contains HTML you can use Jsoup library like this to get HTML elements:

    String htmlTable= "<table><tr><td>Hello World!</td></tr></table>";
    Document doc = Jsoup.parse(htmlTable);
    
    // then use something like this to get your element:
    Elements tds = doc.getElementsByTag("td");
    
    // tds will contain this one element: <td>Hello World!</td>
    

    Good luck!

    0 讨论(0)
  • 2021-01-04 08:14

    Here's a way:

    import java.io.*;
    import javax.swing.text.*;
    import javax.swing.text.html.*;
    import javax.swing.text.html.parser.*;
    
    public class HtmlParseDemo {
       public static void main(String [] args) throws Exception {
           Reader reader = new StringReader("<table><tr><td>Hello</td><td>World!</td></tr></table>");
           HTMLEditorKit.Parser parser = new ParserDelegator();
           parser.parse(reader, new HTMLTableParser(), true);
           reader.close();
       }
    }
    
    class HTMLTableParser extends HTMLEditorKit.ParserCallback {
    
        private boolean encounteredATableRow = false;
    
        public void handleText(char[] data, int pos) {
            if(encounteredATableRow) System.out.println(new String(data));
        }
    
        public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
            if(t == HTML.Tag.TR) encounteredATableRow = true;
        }
    
        public void handleEndTag(HTML.Tag t, int pos) {
            if(t == HTML.Tag.TR) encounteredATableRow = false;
        }
    }
    
    0 讨论(0)
  • 2021-01-04 08:21

    You could use Swing:

    How do you make use of the HTML-processing capabilities that are built into Java? You may not know that Swing contains all the classes necessary to parse HTML. Jeff Heaton shows you how.

    0 讨论(0)
提交回复
热议问题