Can you provide examples of parsing HTML?

后端未结

关注

 29  2256

走了就别回头了

How do you parse HTML with a variety of languages and parsing libraries?

When answering:

Individual comments will be linked to in answers to questions

相关标签:

29条回答

遥遥无期

2020-11-22 14:19

language: Python
library: HTMLParser

#!/usr/bin/python

from HTMLParser import HTMLParser

class FindLinks(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)

    def handle_starttag(self, tag, attrs):
        at = dict(attrs)
        if tag == 'a' and 'href' in at:
            print at['href']


find = FindLinks()

html = "<html><body>"
for link in ("foo", "bar", "baz"):
    html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"

find.feed(html)

0 讨论(0)

南笙

2020-11-22 14:20

Language: C#
Library: System.XML (standard .NET)

using System.Collections.Generic;
using System.Xml;

public static void Main(string[] args)
{
    List<string> matches = new List<string>();

    XmlDocument xd = new XmlDocument();
    xd.LoadXml("<html>...</html>");

    FindHrefs(xd.FirstChild, matches);
}

static void FindHrefs(XmlNode xn, List<string> matches)
{
    if (xn.Attributes != null && xn.Attributes["href"] != null)
        matches.Add(xn.Attributes["href"].InnerXml);

    foreach (XmlNode child in xn.ChildNodes)
        FindHrefs(child, matches);
}

0 讨论(0)

闹比i

2020-11-22 14:20

Language: PHP Library: DOM

<?php
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTMLFile('http://stackoverflow.com/questions/773340');
$xpath = new DOMXpath($doc);

$links = $xpath->query('//a[@href]');
for ($i = 0; $i < $links->length; $i++)
    echo $links->item($i)->getAttribute('href'), "\n";

Sometimes it's useful to put @ symbol before $doc->loadHTMLFile to suppress invalid html parsing warnings

0 讨论(0)

温柔的废话

2020-11-22 14:21

Language: Ruby
Library: Nokogiri

#!/usr/bin/env ruby
require 'nokogiri'
require 'open-uri'

document = Nokogiri::HTML(open("http://google.com"))
document.css("html head title").first.content
=> "Google"
document.xpath("//title").first.content
=> "Google"

0 讨论(0)

梦如初夏

2020-11-22 14:22
Language Perl
Library: HTML::LinkExtor

Beauty of Perl is that you have modules for very specific tasks. Like link extraction.

Whole program:
```
#!/usr/bin/perl -w
use strict;

use HTML::LinkExtor;
use LWP::Simple;

my $url     = 'http://www.google.com/';
my $content = get( $url );

my $p       = HTML::LinkExtor->new( \&process_link, $url, );
$p->parse( $content );

exit;

sub process_link {
    my ( $tag, %attr ) = @_;

    return unless $tag eq 'a';
    return unless defined $attr{ 'href' };

    print "- $attr{'href'}\n";
    return;
}
```
Explanation:
- use strict - turns on "strict" mode - eases potential debugging, not fully relevant to the example
- use HTML::LinkExtor - load of interesting module
- use LWP::Simple - just a simple way to get some html for tests
- my $url = 'http://www.google.com/' - which page we will be extracting urls from
- my $content = get( $url ) - fetches page html
- my $p = HTML::LinkExtor->new( \&process_link, $url ) - creates LinkExtor object, givin it reference to function that will be used as callback on every url, and $url to use as BASEURL for relative urls
- $p->parse( $content ) - pretty obvious I guess
- exit - end of program
- sub process_link - begin of function process_link
- my ($tag, %attr) - get arguments, which are tag name, and its atributes
- return unless $tag eq 'a' - skip processing if the tag is not <a>
- return unless defeined $attr{'href'} - skip processing if the <a> tag doesn't have href attribute
- print "- $attr{'href'}\n"; - pretty obvious I guess :)
- return; - finish the function
That's all.
0 讨论(0)
发布评论:

提交评论
- 加载中...

梦如初夏

2020-11-22 14:22

language: Perl
library: XML::Twig

#!/usr/bin/perl
use strict;
use warnings;
use Encode ':all';

use LWP::Simple;
use XML::Twig;

#my $url = 'http://stackoverflow.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser';
my $url = 'http://www.google.com';
my $content = get($url);
die "Couldn't fetch!" unless defined $content;

my $twig = XML::Twig->new();
$twig->parse_html($content);

my @hrefs = map {
    $_->att('href');
} $twig->get_xpath('//*[@href]');

print "$_\n" for @hrefs;

caveat: Can get wide-character errors with pages like this one (changing the url to the one commented out will get this error), but the HTML::Parser solution above doesn't share this problem.

0 讨论(0)

1 2 3 4 5 下一页