Can you provide examples of parsing HTML?

后端 未结 29 2217
走了就别回头了
走了就别回头了 2020-11-22 13:49

How do you parse HTML with a variety of languages and parsing libraries?


When answering:

Individual comments will be linked to in answers to questions

相关标签:
29条回答
  • 2020-11-22 14:19

    language: Python
    library: HTMLParser

    #!/usr/bin/python
    
    from HTMLParser import HTMLParser
    
    class FindLinks(HTMLParser):
        def __init__(self):
            HTMLParser.__init__(self)
    
        def handle_starttag(self, tag, attrs):
            at = dict(attrs)
            if tag == 'a' and 'href' in at:
                print at['href']
    
    
    find = FindLinks()
    
    html = "<html><body>"
    for link in ("foo", "bar", "baz"):
        html += '<a href="http://%s.com">%s</a>' % (link, link)
    html += "</body></html>"
    
    find.feed(html)
    
    0 讨论(0)
  • 2020-11-22 14:20

    Language: C#
    Library: System.XML (standard .NET)

    using System.Collections.Generic;
    using System.Xml;
    
    public static void Main(string[] args)
    {
        List<string> matches = new List<string>();
    
        XmlDocument xd = new XmlDocument();
        xd.LoadXml("<html>...</html>");
    
        FindHrefs(xd.FirstChild, matches);
    }
    
    static void FindHrefs(XmlNode xn, List<string> matches)
    {
        if (xn.Attributes != null && xn.Attributes["href"] != null)
            matches.Add(xn.Attributes["href"].InnerXml);
    
        foreach (XmlNode child in xn.ChildNodes)
            FindHrefs(child, matches);
    }
    
    0 讨论(0)
  • 2020-11-22 14:20

    Language: PHP Library: DOM

    <?php
    $doc = new DOMDocument();
    $doc->strictErrorChecking = false;
    $doc->loadHTMLFile('http://stackoverflow.com/questions/773340');
    $xpath = new DOMXpath($doc);
    
    $links = $xpath->query('//a[@href]');
    for ($i = 0; $i < $links->length; $i++)
        echo $links->item($i)->getAttribute('href'), "\n";
    

    Sometimes it's useful to put @ symbol before $doc->loadHTMLFile to suppress invalid html parsing warnings

    0 讨论(0)
  • 2020-11-22 14:21

    Language: Ruby
    Library: Nokogiri

    #!/usr/bin/env ruby
    require 'nokogiri'
    require 'open-uri'
    
    document = Nokogiri::HTML(open("http://google.com"))
    document.css("html head title").first.content
    => "Google"
    document.xpath("//title").first.content
    => "Google"
    
    0 讨论(0)
  • 2020-11-22 14:22

    Language Perl
    Library: HTML::LinkExtor

    Beauty of Perl is that you have modules for very specific tasks. Like link extraction.

    Whole program:

    #!/usr/bin/perl -w
    use strict;
    
    use HTML::LinkExtor;
    use LWP::Simple;
    
    my $url     = 'http://www.google.com/';
    my $content = get( $url );
    
    my $p       = HTML::LinkExtor->new( \&process_link, $url, );
    $p->parse( $content );
    
    exit;
    
    sub process_link {
        my ( $tag, %attr ) = @_;
    
        return unless $tag eq 'a';
        return unless defined $attr{ 'href' };
    
        print "- $attr{'href'}\n";
        return;
    }
    

    Explanation:

    • use strict - turns on "strict" mode - eases potential debugging, not fully relevant to the example
    • use HTML::LinkExtor - load of interesting module
    • use LWP::Simple - just a simple way to get some html for tests
    • my $url = 'http://www.google.com/' - which page we will be extracting urls from
    • my $content = get( $url ) - fetches page html
    • my $p = HTML::LinkExtor->new( \&process_link, $url ) - creates LinkExtor object, givin it reference to function that will be used as callback on every url, and $url to use as BASEURL for relative urls
    • $p->parse( $content ) - pretty obvious I guess
    • exit - end of program
    • sub process_link - begin of function process_link
    • my ($tag, %attr) - get arguments, which are tag name, and its atributes
    • return unless $tag eq 'a' - skip processing if the tag is not <a>
    • return unless defeined $attr{'href'} - skip processing if the <a> tag doesn't have href attribute
    • print "- $attr{'href'}\n"; - pretty obvious I guess :)
    • return; - finish the function

    That's all.

    0 讨论(0)
  • 2020-11-22 14:22

    language: Perl
    library: XML::Twig

    #!/usr/bin/perl
    use strict;
    use warnings;
    use Encode ':all';
    
    use LWP::Simple;
    use XML::Twig;
    
    #my $url = 'http://stackoverflow.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser';
    my $url = 'http://www.google.com';
    my $content = get($url);
    die "Couldn't fetch!" unless defined $content;
    
    my $twig = XML::Twig->new();
    $twig->parse_html($content);
    
    my @hrefs = map {
        $_->att('href');
    } $twig->get_xpath('//*[@href]');
    
    print "$_\n" for @hrefs;
    

    caveat: Can get wide-character errors with pages like this one (changing the url to the one commented out will get this error), but the HTML::Parser solution above doesn't share this problem.

    0 讨论(0)
提交回复
热议问题