Can you provide examples of parsing HTML?

走了就别回头了 2020-11-22 13:49

How do you parse HTML with a variety of languages and parsing libraries?

  • 2020-11-22 14:19

    language: Python
    library: HTMLParser

    from HTMLParser import HTMLParser
    class FindLinks(HTMLParser):
        def __init__(self):
        def handle_starttag(self, tag, attrs):
            at = dict(attrs)
            if tag == 'a' and 'href' in at:
                print at['href']
    find = FindLinks()
    html = "<html><body>"
    for link in ("foo", "bar", "baz"):
        html += '<a href="">%s</a>' % (link, link)
    html += "</body></html>"
  • 2020-11-22 14:20

    Language: C#
    Library: System.XML (standard .NET)

    using System.Collections.Generic;
    using System.Xml;
    public static void Main(string[] args)
        List<string> matches = new List<string>();
        XmlDocument xd = new XmlDocument();
        FindHrefs(xd.FirstChild, matches);
    static void FindHrefs(XmlNode xn, List<string> matches)
        if (xn.Attributes != null && xn.Attributes["href"] != null)
        foreach (XmlNode child in xn.ChildNodes)
            FindHrefs(child, matches);
  • 2020-11-22 14:20

    Language: PHP Library: DOM

    $doc = new DOMDocument();
    $doc->strictErrorChecking = false;
    $xpath = new DOMXpath($doc);
    $links = $xpath->query('//a[@href]');
    for ($i = 0; $i < $links->length; $i++)
        echo $links->item($i)->getAttribute('href'), "\n";

    Sometimes it's useful to put @ symbol before $doc->loadHTMLFile to suppress invalid html parsing warnings

  • 2020-11-22 14:21

    Language: Ruby
    Library: Nokogiri

    #!/usr/bin/env ruby
    require 'nokogiri'
    require 'open-uri'
    document = Nokogiri::HTML(open(""))
    document.css("html head title").first.content
    => "Google"
    => "Google"
  • 2020-11-22 14:22

    Language Perl
    Library: HTML::LinkExtor

    Beauty of Perl is that you have modules for very specific tasks. Like link extraction.

    Whole program:

    #!/usr/bin/perl -w
    use strict;
    use HTML::LinkExtor;
    use LWP::Simple;
    my $url     = '';
    my $content = get( $url );
    my $p       = HTML::LinkExtor->new( \&process_link, $url, );
    $p->parse( $content );
    sub process_link {
        my ( $tag, %attr ) = @_;
        return unless $tag eq 'a';
        return unless defined $attr{ 'href' };
        print "- $attr{'href'}\n";


    • use strict - turns on "strict" mode - eases potential debugging, not fully relevant to the example
    • use HTML::LinkExtor - load of interesting module
    • use LWP::Simple - just a simple way to get some html for tests
    • my $url = '' - which page we will be extracting urls from
    • my $content = get( $url ) - fetches page html
    • my $p = HTML::LinkExtor->new( \&process_link, $url ) - creates LinkExtor object, givin it reference to function that will be used as callback on every url, and $url to use as BASEURL for relative urls
    • $p->parse( $content ) - pretty obvious I guess
    • exit - end of program
    • sub process_link - begin of function process_link
    • my ($tag, %attr) - get arguments, which are tag name, and its atributes
    • return unless $tag eq 'a' - skip processing if the tag is not <a>
    • return unless defeined $attr{'href'} - skip processing if the <a> tag doesn't have href attribute
    • print "- $attr{'href'}\n"; - pretty obvious I guess :)
    • return; - finish the function

    That's all.

  • 2020-11-22 14:22

    language: Perl
    library: XML::Twig

    use strict;
    use warnings;
    use Encode ':all';
    use LWP::Simple;
    use XML::Twig;
    #my $url = '';
    my $url = '';
    my $content = get($url);
    die "Couldn't fetch!" unless defined $content;
    my $twig = XML::Twig->new();
    my @hrefs = map {
    } $twig->get_xpath('//*[@href]');
    print "$_\n" for @hrefs;

    caveat: Can get wide-character errors with pages like this one (changing the url to the one commented out will get this error), but the HTML::Parser solution above doesn't share this problem.

