How to extract info from HTML with Java's own Parser?

问题

I don't want to download any other libraries, i'm talking about this one: javax.swing.text.html.HTMLEditorKit.Parser

How can I extract repeated information within a page using this parser?

Say for example I have this code repeated in a page:

    <tr>
      <td class="info1">get this info</td>
      <td class="info2">get this info</td>
      <td class="info3">get this info</td>
    </tr>

Can I have any example code please?

Thanks in advance.

回答1:

It's a stream parser, so as it parses it tells you what it hits. You should extend HTMLEditorKit.ParserCallback with some class (I'll call it Parser), then override the methods you care about.

I believe it only works for "the html dtd in swing" (see here). If you're doing anything more complicated recommend you instead use an external Java HTML parsing library, such as one of the ones I linked to before.

Here's the basic code (demo):

import javax.swing.text.html.parser.*;
import javax.swing.text.html.*;
import javax.swing.text.*;
import java.io.*;

class Parser extends HTMLEditorKit.ParserCallback
{
        private boolean inTD = false;

        public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos)
        {
                if(t.equals(HTML.Tag.TD))
                {
                        inTD = true;
                }
        }

        public void handleEndTag(HTML.Tag t, int pos)
        {
                if(t.equals(HTML.Tag.TD))
                {
                        inTD = false;
                }
        }

        public void handleText(char[] data, int pos)
        {
                if(inTD)
                {
                        doSomethingWith(data);
                }
        }

        public void doSomethingWith(char[] data)
        {
                System.out.println(data);
        }

}

class HtmlTester
{
        public static void main (String[] args) throws java.lang.Exception
        {               
            ParserDelegator pd = new ParserDelegator();
            pd.parse(new BufferedReader(new InputStreamReader(System.in)), new Parser(), false);
        }
}

回答2:

Matthew Flaschen answers your direct question. I just want to add a couple of bits of advice:

If you have control (to some degree) over the source of the HTML you are parsing, you should consider changing that source to emit the information in a better form. For example, if it is a web server, get it to respect Accept headers and provide the information in (say) XML or JSON formats when requested.
If you have no control over the source of the HTML, you are at the mercy of whoever does control it. If they change the HTML structure, your parsing may break. This applies whether you use a proper HTML parser or (blech) regexes.

Your best bet to insulate yourself against this is to use a permissive HTML parser (such as JSoup) that understands different versions of the HTML spec, and is more or less tolerant of HTML that violates the specs. (The problem with using a strict parser is that a small mistake such as a missing </li> will render the page unparsable ... for your parser ... even though the page displays just fine in most web browsers.)
It is a bad idea to restrict yourself to using only the standard Java class libraries. The standard libraries often simply don't provide the best solution.

来源：https://stackoverflow.com/questions/9745948/how-to-extract-info-from-html-with-javas-own-parser

标签

java

html-parsing