How can I translate this XPath expression to BeautifulSoup?

前端未结

关注

 4  954

In answer to a previous question, several people suggested that I use BeautifulSoup for my project. I\'ve been struggling with their documentation and I just cannot parse it

相关标签:

4条回答

一整个雨季

2021-01-04 21:04
one option is to use lxml (I'm not familiar with beautifulsoup, so I can't say how to do with it), it defaultly supports XPath

Edit:
try ~~(untested)~~ tested:
```
soup.findAll('td', 'altRow')[1].findAll('a', href=re.compile(r'/.a\w+'), recursive=False)
```
I used docs at http://www.crummy.com/software/BeautifulSoup/documentation.html

soup should be a BeautifulSoup object
```
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html_string)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
梦谈多话

2021-01-04 21:07

I just answered this on the Beautiful Soup mailing list as a response to Zeynel's email to the list. Basically, there is an error in the web page that totally kills Beautiful Soup 3.1 during parsing, but is merely mangled by Beautiful Soup 3.0.

The thread is located at the Google Groups archive.

0 讨论(0)
发布评论:

提交评论
- 加载中...

忘了有多久

2021-01-04 21:16

I know BeautifulSoup is the canonical HTML parsing module, but sometimes you just want to scrape out some substrings from some HTML, and pyparsing has some useful methods to do this. Using this code:

from pyparsing import makeHTMLTags, withAttribute, SkipTo
import urllib

# get the HTML from your URL
url = "http://www.whitecase.com/Attorneys/List.aspx?LastName=&FirstName="
page = urllib.urlopen(url)
html = page.read()
page.close()

# define opening and closing tag expressions for <td> and <a> tags
# (makeHTMLTags also comprehends tag variations, including attributes, 
# upper/lower case, etc.)
tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")

# only interested in tdStarts if they have "class=altRow" attribute
tdStart.setParseAction(withAttribute(("class","altRow")))

# compose total matching pattern (add trailing tdStart to filter out 
# extraneous <td> matches)
patt = tdStart + aStart("a") + SkipTo(aEnd)("text") + aEnd + tdEnd + tdStart

# scan input HTML source for matching refs, and print out the text and 
# href values
for ref,s,e in patt.scanString(html):
    print ref.text, ref.a.href

I extracted 914 references from your page, from Abel to Zupikova.

Abel, Christian /cabel
Acevedo, Linda Jeannine /jacevedo
AcuÃ±a, Jennifer /jacuna
Adeyemi, Ike /igbadegesin
Adler, Avraham /aadler
...
Zhu, Jie /jzhu
ZÃdek, AleÅ¡ /azidek
ZiÃ³Å‚ek, Agnieszka /aziolek
Zitter, Adam /azitter
Zupikova, Jana /jzupikova

0 讨论(0)

天涯浪人

2021-01-04 21:18
It seems that you are using BeautifulSoup 3.1

I suggest reverting to BeautifulSoup 3.0.7 (because of this problem)

I just tested with 3.0.7 and got the results you expect:
```
>>> soup.findAll(href=re.compile(r'/cabel'))
[<a href="/cabel">Abel, Christian</a>]
```
Testing with BeautifulSoup 3.1 gets the results you are seeing. There is probably a malformed tag in the html but I didn't see what it was in a quick look.
0 讨论(0)
发布评论:

提交评论
- 加载中...