Can you provide examples of parsing HTML?

后端 未结 29 2220
走了就别回头了
走了就别回头了 2020-11-22 13:49

How do you parse HTML with a variety of languages and parsing libraries?


When answering:

Individual comments will be linked to in answers to questions

相关标签:
29条回答
  • 2020-11-22 14:22

    Language: JavaScript/Node.js

    Library: Request and Cheerio

    var request = require('request');
    var cheerio = require('cheerio');
    
    var url = "https://news.ycombinator.com/";
    request(url, function (error, response, html) {
        if (!error && response.statusCode == 200) {
            var $ = cheerio.load(html);
            var anchorTags = $('a');
    
            anchorTags.each(function(i,element){
                console.log(element["attribs"]["href"]);
            });
        }
    });
    

    Request library downloads the html document and Cheerio lets you use jquery css selectors to target the html document.

    0 讨论(0)
  • 2020-11-22 14:23

    Language: Perl
    Library: HTML::Parser
    Purpose: How can I remove unused, nested HTML span tags with a Perl regex?

    0 讨论(0)
  • 2020-11-22 14:23

    Language: JavaScript
    Library: DOM

    var links = document.links;
    for(var i in links){
        var href = links[i].href;
        if(href != null) console.debug(href);
    }
    

    (using firebug console.debug for output...)

    0 讨论(0)
  • 2020-11-22 14:27

    Language: Clojure
    Library: Enlive (a selector-based (à la CSS) templating and transformation system for Clojure)


    Selector expression:

    (def test-select
         (html/select (html/html-resource (java.io.StringReader. test-html)) [:a]))
    

    Now we can do the following at the REPL (I've added line breaks in test-select):

    user> test-select
    ({:tag :a, :attrs {:href "http://foo.com/"}, :content ["foo"]}
     {:tag :a, :attrs {:href "http://bar.com/"}, :content ["bar"]}
     {:tag :a, :attrs {:href "http://baz.com/"}, :content ["baz"]})
    user> (map #(get-in % [:attrs :href]) test-select)
    ("http://foo.com/" "http://bar.com/" "http://baz.com/")
    

    You'll need the following to try it out:

    Preamble:

    (require '[net.cgrand.enlive-html :as html])
    

    Test HTML:

    (def test-html
         (apply str (concat ["<html><body>"]
                            (for [link ["foo" "bar" "baz"]]
                              (str "<a href=\"http://" link ".com/\">" link "</a>"))
                            ["</body></html>"])))
    
    0 讨论(0)
  • 2020-11-22 14:27

    Using phantomjs, save this file as extract-links.js:

    var page = new WebPage(),
        url = 'http://www.udacity.com';
    
    page.open(url, function (status) {
        if (status !== 'success') {
            console.log('Unable to access network');
        } else {
            var results = page.evaluate(function() {
                var list = document.querySelectorAll('a'), links = [], i;
                for (i = 0; i < list.length; i++) {
                    links.push(list[i].href);
                }
                return links;
            });
            console.log(results.join('\n'));
        }
        phantom.exit();
    });
    

    run:

    $ ../path/to/bin/phantomjs extract-links.js
    
    0 讨论(0)
  • 2020-11-22 14:28

    Language: Objective-C
    Library: libxml2 + Matt Gallagher's libxml2 wrappers + Ben Copsey's ASIHTTPRequest

    ASIHTTPRequest *request = [ASIHTTPRequest alloc] initWithURL:[NSURL URLWithString:@"http://stackoverflow.com/questions/773340"];
    [request start];
    NSError *error = [request error];
    if (!error) {
        NSData *response = [request responseData];
        NSLog(@"Data: %@", [[self query:@"//a[@href]" withResponse:response] description]);
        [request release];
    }
    else 
        @throw [NSException exceptionWithName:@"kMyHTTPRequestFailed" reason:@"Request failed!" userInfo:nil];
    
    ...
    
    - (id) query:(NSString *)xpathQuery WithResponse:(NSData *)resp {
        NSArray *nodes = PerformHTMLXPathQuery(resp, xpathQuery);
        if (nodes != nil)
            return nodes;
        return nil;
    }
    
    0 讨论(0)
提交回复
热议问题