In answer to a previous question, several people suggested that I use BeautifulSoup for my project. I\'ve been struggling with their documentation and I just cannot parse it
one option is to use lxml (I'm not familiar with beautifulsoup, so I can't say how to do with it), it defaultly supports XPath
Edit:
try (untested) tested:
soup.findAll('td', 'altRow')[1].findAll('a', href=re.compile(r'/.a\w+'), recursive=False)
I used docs at http://www.crummy.com/software/BeautifulSoup/documentation.html
soup should be a BeautifulSoup object
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html_string)
I just answered this on the Beautiful Soup mailing list as a response to Zeynel's email to the list. Basically, there is an error in the web page that totally kills Beautiful Soup 3.1 during parsing, but is merely mangled by Beautiful Soup 3.0.
The thread is located at the Google Groups archive.
I know BeautifulSoup is the canonical HTML parsing module, but sometimes you just want to scrape out some substrings from some HTML, and pyparsing has some useful methods to do this. Using this code:
from pyparsing import makeHTMLTags, withAttribute, SkipTo
import urllib
# get the HTML from your URL
url = "http://www.whitecase.com/Attorneys/List.aspx?LastName=&FirstName="
page = urllib.urlopen(url)
html = page.read()
page.close()
# define opening and closing tag expressions for <td> and <a> tags
# (makeHTMLTags also comprehends tag variations, including attributes,
# upper/lower case, etc.)
tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")
# only interested in tdStarts if they have "class=altRow" attribute
tdStart.setParseAction(withAttribute(("class","altRow")))
# compose total matching pattern (add trailing tdStart to filter out
# extraneous <td> matches)
patt = tdStart + aStart("a") + SkipTo(aEnd)("text") + aEnd + tdEnd + tdStart
# scan input HTML source for matching refs, and print out the text and
# href values
for ref,s,e in patt.scanString(html):
print ref.text, ref.a.href
I extracted 914 references from your page, from Abel to Zupikova.
Abel, Christian /cabel
Acevedo, Linda Jeannine /jacevedo
Acuña, Jennifer /jacuna
Adeyemi, Ike /igbadegesin
Adler, Avraham /aadler
...
Zhu, Jie /jzhu
ZÃdek, AleÅ¡ /azidek
Ziółek, Agnieszka /aziolek
Zitter, Adam /azitter
Zupikova, Jana /jzupikova
It seems that you are using BeautifulSoup 3.1
I suggest reverting to BeautifulSoup 3.0.7 (because of this problem)
I just tested with 3.0.7 and got the results you expect:
>>> soup.findAll(href=re.compile(r'/cabel'))
[<a href="/cabel">Abel, Christian</a>]
Testing with BeautifulSoup 3.1 gets the results you are seeing. There is probably a malformed tag in the html but I didn't see what it was in a quick look.