How do you parse HTML with a variety of languages and parsing libraries?
When answering:
Individual comments will be linked to in answers to questions
Language: JavaScript/Node.js
Library: Request and Cheerio
var request = require('request');
var cheerio = require('cheerio');
var url = "https://news.ycombinator.com/";
request(url, function (error, response, html) {
if (!error && response.statusCode == 200) {
var $ = cheerio.load(html);
var anchorTags = $('a');
anchorTags.each(function(i,element){
console.log(element["attribs"]["href"]);
});
}
});
Request library downloads the html document and Cheerio lets you use jquery css selectors to target the html document.
Language: Perl
Library: HTML::Parser
Purpose: How can I remove unused, nested HTML span tags with a Perl regex?
Language: JavaScript
Library: DOM
var links = document.links;
for(var i in links){
var href = links[i].href;
if(href != null) console.debug(href);
}
(using firebug console.debug for output...)
Language: Clojure
Library: Enlive (a selector-based (à la CSS) templating and transformation system for Clojure)
Selector expression:
(def test-select
(html/select (html/html-resource (java.io.StringReader. test-html)) [:a]))
Now we can do the following at the REPL (I've added line breaks in test-select
):
user> test-select
({:tag :a, :attrs {:href "http://foo.com/"}, :content ["foo"]}
{:tag :a, :attrs {:href "http://bar.com/"}, :content ["bar"]}
{:tag :a, :attrs {:href "http://baz.com/"}, :content ["baz"]})
user> (map #(get-in % [:attrs :href]) test-select)
("http://foo.com/" "http://bar.com/" "http://baz.com/")
You'll need the following to try it out:
Preamble:
(require '[net.cgrand.enlive-html :as html])
Test HTML:
(def test-html
(apply str (concat ["<html><body>"]
(for [link ["foo" "bar" "baz"]]
(str "<a href=\"http://" link ".com/\">" link "</a>"))
["</body></html>"])))
Using phantomjs, save this file as extract-links.js:
var page = new WebPage(),
url = 'http://www.udacity.com';
page.open(url, function (status) {
if (status !== 'success') {
console.log('Unable to access network');
} else {
var results = page.evaluate(function() {
var list = document.querySelectorAll('a'), links = [], i;
for (i = 0; i < list.length; i++) {
links.push(list[i].href);
}
return links;
});
console.log(results.join('\n'));
}
phantom.exit();
});
run:
$ ../path/to/bin/phantomjs extract-links.js
Language: Objective-C
Library: libxml2 + Matt Gallagher's libxml2 wrappers + Ben Copsey's ASIHTTPRequest
ASIHTTPRequest *request = [ASIHTTPRequest alloc] initWithURL:[NSURL URLWithString:@"http://stackoverflow.com/questions/773340"];
[request start];
NSError *error = [request error];
if (!error) {
NSData *response = [request responseData];
NSLog(@"Data: %@", [[self query:@"//a[@href]" withResponse:response] description]);
[request release];
}
else
@throw [NSException exceptionWithName:@"kMyHTTPRequestFailed" reason:@"Request failed!" userInfo:nil];
...
- (id) query:(NSString *)xpathQuery WithResponse:(NSData *)resp {
NSArray *nodes = PerformHTMLXPathQuery(resp, xpathQuery);
if (nodes != nil)
return nodes;
return nil;
}