How do you parse HTML with a variety of languages and parsing libraries?
When answering:
Individual comments will be linked to in answers to questions
language: Python
library: lxml.html
import lxml.html
html = ""
for link in ("foo", "bar", "baz"):
html += '%s' % (link, link)
html += ""
tree = lxml.html.document_fromstring(html)
for element, attribute, link, pos in tree.iterlinks():
if attribute == "href":
print link
lxml also has a CSS selector class for traversing the DOM, which can make using it very similar to using JQuery:
for a in tree.cssselect('a[href]'):
print a.get('href')