Writing an HTML Parser | 易学教程

问题

I am currently attempting (or planning to attempt) to write a simple (as possible) program to parse an html document into a tree.

After googling I have found many answers saying "don't do it it's been done" (or words to that effect); and references to examples of HTML parsers; and also a rather emphatic article on why one shouldn't use Regular expresions. However I haven't found any guides on the "right" way to write a parser. (This, by the way, is something I'm attempting more as a learning exersise than anything so I'd quite like to do it rather than use a premade one)

I believe I could make a working XML parser just by reading the document and adding the tags/text etc. to the tree, stepping up a level whenever I hit a close tag (again, simple, no fancy threading or efficiency required at this stage.). However, for HTML not all tags are closed.

So my question is this: what would you recommend as a way of dealing with this? The only idea I've had is to treat it in a similar way as the XML but have a list of tags that aren't necessarily closed each with conditions for closure (e.g. ends on or next tag).

Has anyone any other (hopefully better) suggestions? Is there a better way of doing this altogether?

回答1:

so, I'll try for an answer here -

basically, what makes "plain" html parsing (not talking about valid xhtml here) different from xml parsing are loads of rules like never-ending <img>tags, or, strictly speaking, the fact that even the sloppiest of all html markups will somewhat render in a browser. You will need a validator along with the parser, to build your tree. But you'll have to decide on a standard for HTML you want to support, so that when you come across a weakness in the markup, you'll know it's an error and not just sloppy html.

know all the rules, build a validator, and then you'll be able to build a parser. that's Plan A.

Plan B would be, to allow for a certain error-resistance in your parser, which would render the validation step needless. For example, parse all the tags, and put them in a list, omitting any attributes, so that you can easily operate on the list, determining whether a tag is left open, or was never opened at all, to eventually get a "good" layout tree, which will be an approximate solution for sloppy layout, while being exact for correct layout.

hope that helped!

回答2:

The looseness of HTML can be accommodated by figuring out the missing open and close tags as needed. This is essentially what a validator like tidy does.

You'll keep a stack (perhaps implicitly with a tree) of the current context. For example, {<html>, <body>} means you're currently in the body of the html document. When you encounter a new node, you compare the requirements for that node to what's currently on the stack.

Suppose your stack is currently just {html}. You encounter a  tag. You look up  in a table that tells you a paragraph must be inside the <body>. Since you're not in the body, you implicitly push <body> onto your stack (or add a body node to your tree). Then you can put the  into the tree.

Now supposed you see another . Your rules tell you that you cannot nest a paragraph within a paragraph, so you know you have to pop the current  off the stack (as though you had seen a close tag) before pushing the new paragraph onto the stack.

At the end of your document, you pop each remaining element off your stack, as though you had seen a close tag for each one.

The trick is to find a good way to represent the context requirements for each element.

回答3:

Since now the html5 standard exist, writing a html parser is no longer trial-and-error or arcane knowledge.

Instead you just have to implement the standardized parsing algorithm.

回答4:

Harsh. Go

HTML is not XML. XHTML is XML. Most websites are HTML; some are XHTML. In XHTML all tags must be closed (or have no body, which is still closed).

If you want to write an HTML parser as a learning experiment, then go for it. If you want to write the next "Greaterest HTML parserer" then give it up. Apache (or somebody else) wins; the important information is: you don't know more than the large groups that specialize in parsing HTML.

To answer the question "How do I deal with this?" Read the W3C Spec on HTML. It answers your question. If your response is "but I don't want too" then you are actually saying "I'm a lazy goofrocket who wants to pretend to learn". If that is the case, I suggest you delete the post and move on; The Microsoft IE team probabaly has some documents that will interest you.

Less harsh answer

HTML is not easy to parse. At its loosest, you don't need head or body elements and alot of tags do not need to be closed. A basic rule when parsing HTML is if you encounter a new block element, automatically close the previous block element. You can not use a standard XML parser for this because HTML is not XML.

Similar to XML, you will need to split your document into elements, including free text elements.

XHTML is much easier because it must be well formed XML. You can use an XML parser for this.

回答5:

Nearly a decade late, but whatever. If not relevant to you, it is to future visitors.

Another option would be to implement the specs.

The WHATWG has a normative specification for HTML. In this all the quirks are thought of, and you are save to not have forgotten some weird mechanic of HTML (there are a lot).

The specification also contains the section § 13.2 Parsing HTML documents, where it outlines how a User Agent (your parser) should parse a html document into a DOM tree. All edge cases are already thought of. The most difficult part is to use the right data structures and program flow in your language of choice to implement it.

Good luck and keep your spirit, reader!

回答6:

Have you tried to use this library : http://simplehtmldom.sourceforge.net/ ?

来源：https://stackoverflow.com/questions/7192101/writing-an-html-parser

标签

html

parsing

html-parsing