Can you provide examples of parsing HTML?

后端未结

关注

 29  2284

走了就别回头了

How do you parse HTML with a variety of languages and parsing libraries?

When answering:

Individual comments will be linked to in answers to questions

相关标签:

29条回答

野的像风

2020-11-22 14:34

Language: Java
Libraries: XOM, TagSoup

I've included intentionally malformed and inconsistent XML in this sample.

import java.io.IOException;

import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Element;
import nu.xom.Node;
import nu.xom.Nodes;
import nu.xom.ParsingException;
import nu.xom.ValidityException;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.SAXException;

public class HtmlTest {
    public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
        final Parser parser = new Parser();
        parser.setFeature(Parser.namespacesFeature, false);
        final Builder builder = new Builder(parser);
        final Document document = builder.build("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>", null);
        final Element root = document.getRootElement();
        final Nodes links = root.query("//a[@href]");
        for (int linkNumber = 0; linkNumber < links.size(); ++linkNumber) {
            final Node node = links.get(linkNumber);
            System.out.println(((Element) node).getAttributeValue("href"));
        }
    }
}

TagSoup adds an XML namespace referencing XHTML to the document by default. I've chosen to suppress that in this sample. Using the default behavior would require the call to root.query to include a namespace like so:

root.query("//xhtml:a[@href]", new nu.xom.XPathContext("xhtml", root.getNamespaceURI())

0 讨论(0)

不思量自难忘°

2020-11-22 14:34

Language: PHP
Library: SimpleXML (and DOM)

<?php
$page = new DOMDocument();
$page->strictErrorChecking = false;
$page->loadHTMLFile('http://stackoverflow.com/questions/773340');
$xml = simplexml_import_dom($page);

$links = $xml->xpath('//a[@href]');
foreach($links as $link)
    echo $link['href']."\n";

0 讨论(0)

無奈伤痛

2020-11-22 14:36
language: shell
library: lynx (well, it's not library, but in shell, every program is kind-of library)
```
lynx -dump -listonly http://news.google.com/
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

闹比i

2020-11-22 14:38

Language: Perl
Library : HTML::TreeBuilder

use strict;
use HTML::TreeBuilder;
use LWP::Simple;

my $content = get 'http://www.stackoverflow.com';
my $document = HTML::TreeBuilder->new->parse($content)->eof;

for my $a ($document->find('a')) {
    print $a->attr('href'), "\n" if $a->attr('href');
}

0 讨论(0)

有刺的猬

2020-11-22 14:39

language: Python
library: lxml.html

import lxml.html

html = "<html><body>"
for link in ("foo", "bar", "baz"):
    html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"

tree = lxml.html.document_fromstring(html)
for element, attribute, link, pos in tree.iterlinks():
    if attribute == "href":
        print link

lxml also has a CSS selector class for traversing the DOM, which can make using it very similar to using JQuery:

for a in tree.cssselect('a[href]'):
    print a.get('href')

0 讨论(0)

说谎

2020-11-22 14:40

Language: Perl
Library: pQuery

use strict;
use warnings;
use pQuery;

my $html = join '',
    "<html><body>",
    (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
    "</body></html>";

pQuery( $html )->find( 'a' )->each(
    sub {  
        my $at = $_->getAttribute( 'href' ); 
        print "$at\n" if defined $at;
    }
);

0 讨论(0)