Can you provide examples of parsing HTML?

后端 未结 29 2219
走了就别回头了
走了就别回头了 2020-11-22 13:49

How do you parse HTML with a variety of languages and parsing libraries?


When answering:

Individual comments will be linked to in answers to questions

相关标签:
29条回答
  • 2020-11-22 14:41

    Language: C#
    Library: HtmlAgilityPack

    class Program
    {
        static void Main(string[] args)
        {
            var web = new HtmlWeb();
            var doc = web.Load("http://www.stackoverflow.com");
    
            var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
    
            foreach (var node in nodes)
            {
                Console.WriteLine(node.InnerHtml);
            }
        }
    }
    
    0 讨论(0)
  • 2020-11-22 14:42

    language: Perl
    library: HTML::Parser

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    use HTML::Parser;
    
    my $find_links = HTML::Parser->new(
        start_h => [
            sub {
                my ($tag, $attr) = @_;
                if ($tag eq 'a' and exists $attr->{href}) {
                    print "$attr->{href}\n";
                }
            }, 
            "tag, attr"
        ]
    );
    
    my $html = join '',
        "<html><body>",
        (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
        "</body></html>";
    
    $find_links->parse($html);
    
    0 讨论(0)
  • 2020-11-22 14:42

    Language: Common Lisp
    Library: Closure Html, Closure Xml, CL-WHO

    (shown using DOM API, without using XPATH or STP API)

    (defvar *html*
      (who:with-html-output-to-string (stream)
        (:html
         (:body (loop
                   for site in (list "foo" "bar" "baz")
                   do (who:htm (:a :href (format nil "http://~A.com/" site))))))))
    
    (defvar *dom*
      (chtml:parse *html* (cxml-dom:make-dom-builder)))
    
    (loop
       for tag across (dom:get-elements-by-tag-name *dom* "a")
       collect (dom:get-attribute tag "href"))
    => 
    ("http://foo.com/" "http://bar.com/" "http://baz.com/")
    
    0 讨论(0)
  • 2020-11-22 14:42

    Language: Racket

    Library: (planet ashinn/html-parser:1) and (planet clements/sxml2:1)

    (require net/url
             (planet ashinn/html-parser:1)
             (planet clements/sxml2:1))
    
    (define the-url (string->url "http://stackoverflow.com/"))
    (define doc (call/input-url the-url get-pure-port html->sxml))
    (define links ((sxpath "//a/@href/text()") doc))
    

    Above example using packages from the new package system: html-parsing and sxml

    (require net/url
             html-parsing
             sxml)
    
    (define the-url (string->url "http://stackoverflow.com/"))
    (define doc (call/input-url the-url get-pure-port html->xexp))
    (define links ((sxpath "//a/@href/text()") doc))
    

    Note: Install the required packages with 'raco' from a command line, with:

    raco pkg install html-parsing
    

    and:

    raco pkg install sxml
    
    0 讨论(0)
  • 2020-11-22 14:42

    Language: Coldfusion 9.0.1+

    Library: jSoup

    <cfscript>
    function parseURL(required string url){
    var res = [];
    var javaLoader = createObject("javaloader.JavaLoader").init([expandPath("./jsoup-1.7.3.jar")]);
    var jSoupClass = javaLoader.create("org.jsoup.Jsoup");
    //var dom = jSoupClass.parse(html); // if you already have some html to parse.
    var dom = jSoupClass.connect( arguments.url ).get();
    var links = dom.select("a");
    for(var a=1;a LT arrayLen(links);a++){
        var s={};s.href= links[a].attr('href'); s.text= links[a].text(); 
        if(s.href contains "http://" || s.href contains "https://") arrayAppend(res,s); 
    }
    return res; 
    }   
    
    //writeoutput(writedump(parseURL(url)));
    </cfscript>
    <cfdump var="#parseURL("http://stackoverflow.com/questions/773340/can-you-provide-examples-of-parsing-html")#">
    

    Returns an array of structures, each struct contains an HREF and TEXT objects.

    0 讨论(0)
提交回复
热议问题