How can I translate this XPath expression to BeautifulSoup?

前端 未结 4 950
耶瑟儿~
耶瑟儿~ 2021-01-04 20:43

In answer to a previous question, several people suggested that I use BeautifulSoup for my project. I\'ve been struggling with their documentation and I just cannot parse it

相关标签:
4条回答
  • 2021-01-04 21:04

    one option is to use lxml (I'm not familiar with beautifulsoup, so I can't say how to do with it), it defaultly supports XPath

    Edit:
    try (untested) tested:

    soup.findAll('td', 'altRow')[1].findAll('a', href=re.compile(r'/.a\w+'), recursive=False)
    

    I used docs at http://www.crummy.com/software/BeautifulSoup/documentation.html

    soup should be a BeautifulSoup object

    import BeautifulSoup
    soup = BeautifulSoup.BeautifulSoup(html_string)
    
    0 讨论(0)
  • 2021-01-04 21:07

    I just answered this on the Beautiful Soup mailing list as a response to Zeynel's email to the list. Basically, there is an error in the web page that totally kills Beautiful Soup 3.1 during parsing, but is merely mangled by Beautiful Soup 3.0.

    The thread is located at the Google Groups archive.

    0 讨论(0)
  • 2021-01-04 21:16

    I know BeautifulSoup is the canonical HTML parsing module, but sometimes you just want to scrape out some substrings from some HTML, and pyparsing has some useful methods to do this. Using this code:

    from pyparsing import makeHTMLTags, withAttribute, SkipTo
    import urllib
    
    # get the HTML from your URL
    url = "http://www.whitecase.com/Attorneys/List.aspx?LastName=&FirstName="
    page = urllib.urlopen(url)
    html = page.read()
    page.close()
    
    # define opening and closing tag expressions for <td> and <a> tags
    # (makeHTMLTags also comprehends tag variations, including attributes, 
    # upper/lower case, etc.)
    tdStart,tdEnd = makeHTMLTags("td")
    aStart,aEnd = makeHTMLTags("a")
    
    # only interested in tdStarts if they have "class=altRow" attribute
    tdStart.setParseAction(withAttribute(("class","altRow")))
    
    # compose total matching pattern (add trailing tdStart to filter out 
    # extraneous <td> matches)
    patt = tdStart + aStart("a") + SkipTo(aEnd)("text") + aEnd + tdEnd + tdStart
    
    # scan input HTML source for matching refs, and print out the text and 
    # href values
    for ref,s,e in patt.scanString(html):
        print ref.text, ref.a.href
    

    I extracted 914 references from your page, from Abel to Zupikova.

    Abel, Christian /cabel
    Acevedo, Linda Jeannine /jacevedo
    Acuña, Jennifer /jacuna
    Adeyemi, Ike /igbadegesin
    Adler, Avraham /aadler
    ...
    Zhu, Jie /jzhu
    Zídek, Aleš /azidek
    Ziółek, Agnieszka /aziolek
    Zitter, Adam /azitter
    Zupikova, Jana /jzupikova
    
    0 讨论(0)
  • 2021-01-04 21:18

    It seems that you are using BeautifulSoup 3.1

    I suggest reverting to BeautifulSoup 3.0.7 (because of this problem)

    I just tested with 3.0.7 and got the results you expect:

    >>> soup.findAll(href=re.compile(r'/cabel'))
    [<a href="/cabel">Abel, Christian</a>]
    

    Testing with BeautifulSoup 3.1 gets the results you are seeing. There is probably a malformed tag in the html but I didn't see what it was in a quick look.

    0 讨论(0)
提交回复
热议问题