How do you parse HTML with a variety of languages and parsing libraries?
When answering:
Individual comments will be linked to in answers to questions
language: Python
library: HTMLParser
#!/usr/bin/python
from HTMLParser import HTMLParser
class FindLinks(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
def handle_starttag(self, tag, attrs):
at = dict(attrs)
if tag == 'a' and 'href' in at:
print at['href']
find = FindLinks()
html = "<html><body>"
for link in ("foo", "bar", "baz"):
html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"
find.feed(html)
Language: C#
Library: System.XML (standard .NET)
using System.Collections.Generic;
using System.Xml;
public static void Main(string[] args)
{
List<string> matches = new List<string>();
XmlDocument xd = new XmlDocument();
xd.LoadXml("<html>...</html>");
FindHrefs(xd.FirstChild, matches);
}
static void FindHrefs(XmlNode xn, List<string> matches)
{
if (xn.Attributes != null && xn.Attributes["href"] != null)
matches.Add(xn.Attributes["href"].InnerXml);
foreach (XmlNode child in xn.ChildNodes)
FindHrefs(child, matches);
}
Language: PHP Library: DOM
<?php
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTMLFile('http://stackoverflow.com/questions/773340');
$xpath = new DOMXpath($doc);
$links = $xpath->query('//a[@href]');
for ($i = 0; $i < $links->length; $i++)
echo $links->item($i)->getAttribute('href'), "\n";
Sometimes it's useful to put @
symbol before $doc->loadHTMLFile
to suppress invalid html parsing warnings
Language: Ruby
Library: Nokogiri
#!/usr/bin/env ruby
require 'nokogiri'
require 'open-uri'
document = Nokogiri::HTML(open("http://google.com"))
document.css("html head title").first.content
=> "Google"
document.xpath("//title").first.content
=> "Google"
Language Perl
Library: HTML::LinkExtor
Beauty of Perl is that you have modules for very specific tasks. Like link extraction.
Whole program:
#!/usr/bin/perl -w
use strict;
use HTML::LinkExtor;
use LWP::Simple;
my $url = 'http://www.google.com/';
my $content = get( $url );
my $p = HTML::LinkExtor->new( \&process_link, $url, );
$p->parse( $content );
exit;
sub process_link {
my ( $tag, %attr ) = @_;
return unless $tag eq 'a';
return unless defined $attr{ 'href' };
print "- $attr{'href'}\n";
return;
}
Explanation:
That's all.
language: Perl
library: XML::Twig
#!/usr/bin/perl
use strict;
use warnings;
use Encode ':all';
use LWP::Simple;
use XML::Twig;
#my $url = 'http://stackoverflow.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser';
my $url = 'http://www.google.com';
my $content = get($url);
die "Couldn't fetch!" unless defined $content;
my $twig = XML::Twig->new();
$twig->parse_html($content);
my @hrefs = map {
$_->att('href');
} $twig->get_xpath('//*[@href]');
print "$_\n" for @hrefs;
caveat: Can get wide-character errors with pages like this one (changing the url to the one commented out will get this error), but the HTML::Parser solution above doesn't share this problem.