Can you provide examples of parsing HTML?

后端 未结 29 2221
走了就别回头了
走了就别回头了 2020-11-22 13:49

How do you parse HTML with a variety of languages and parsing libraries?


When answering:

Individual comments will be linked to in answers to questions

相关标签:
29条回答
  • 2020-11-22 14:34

    Language: Java
    Libraries: XOM, TagSoup

    I've included intentionally malformed and inconsistent XML in this sample.

    import java.io.IOException;
    
    import nu.xom.Builder;
    import nu.xom.Document;
    import nu.xom.Element;
    import nu.xom.Node;
    import nu.xom.Nodes;
    import nu.xom.ParsingException;
    import nu.xom.ValidityException;
    
    import org.ccil.cowan.tagsoup.Parser;
    import org.xml.sax.SAXException;
    
    public class HtmlTest {
        public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
            final Parser parser = new Parser();
            parser.setFeature(Parser.namespacesFeature, false);
            final Builder builder = new Builder(parser);
            final Document document = builder.build("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>", null);
            final Element root = document.getRootElement();
            final Nodes links = root.query("//a[@href]");
            for (int linkNumber = 0; linkNumber < links.size(); ++linkNumber) {
                final Node node = links.get(linkNumber);
                System.out.println(((Element) node).getAttributeValue("href"));
            }
        }
    }
    

    TagSoup adds an XML namespace referencing XHTML to the document by default. I've chosen to suppress that in this sample. Using the default behavior would require the call to root.query to include a namespace like so:

    root.query("//xhtml:a[@href]", new nu.xom.XPathContext("xhtml", root.getNamespaceURI())
    
    0 讨论(0)
  • 2020-11-22 14:34

    Language: PHP
    Library: SimpleXML (and DOM)

    <?php
    $page = new DOMDocument();
    $page->strictErrorChecking = false;
    $page->loadHTMLFile('http://stackoverflow.com/questions/773340');
    $xml = simplexml_import_dom($page);
    
    $links = $xml->xpath('//a[@href]');
    foreach($links as $link)
        echo $link['href']."\n";
    
    0 讨论(0)
  • 2020-11-22 14:36

    language: shell
    library: lynx (well, it's not library, but in shell, every program is kind-of library)

    lynx -dump -listonly http://news.google.com/
    
    0 讨论(0)
  • 2020-11-22 14:38

    Language: Perl
    Library : HTML::TreeBuilder

    use strict;
    use HTML::TreeBuilder;
    use LWP::Simple;
    
    my $content = get 'http://www.stackoverflow.com';
    my $document = HTML::TreeBuilder->new->parse($content)->eof;
    
    for my $a ($document->find('a')) {
        print $a->attr('href'), "\n" if $a->attr('href');
    }
    
    0 讨论(0)
  • 2020-11-22 14:39

    language: Python
    library: lxml.html

    import lxml.html
    
    html = "<html><body>"
    for link in ("foo", "bar", "baz"):
        html += '<a href="http://%s.com">%s</a>' % (link, link)
    html += "</body></html>"
    
    tree = lxml.html.document_fromstring(html)
    for element, attribute, link, pos in tree.iterlinks():
        if attribute == "href":
            print link
    

    lxml also has a CSS selector class for traversing the DOM, which can make using it very similar to using JQuery:

    for a in tree.cssselect('a[href]'):
        print a.get('href')
    
    0 讨论(0)
  • 2020-11-22 14:40

    Language: Perl
    Library: pQuery

    use strict;
    use warnings;
    use pQuery;
    
    my $html = join '',
        "<html><body>",
        (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
        "</body></html>";
    
    pQuery( $html )->find( 'a' )->each(
        sub {  
            my $at = $_->getAttribute( 'href' ); 
            print "$at\n" if defined $at;
        }
    );
    
    0 讨论(0)
提交回复
热议问题