Using BeautifulSoup to find a HTML tag that contains certain text

前端 未结 3 1491
情歌与酒
情歌与酒 2020-11-28 08:12

I\'m trying to get the elements in an HTML doc that contain the following pattern of text: #\\S{11}

this is cool #12345678901

<
相关标签:
3条回答
  • 2020-11-28 09:06

    With bs4 (Beautiful Soup 4), the OP's attempt works exactly like expected:

    from bs4 import BeautifulSoup
    soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
    soup('h2',text=re.compile(r' #\S{11}'))
    

    returns [<h2> this is cool #12345678901 </h2>].

    0 讨论(0)
  • 2020-11-28 09:11
    from BeautifulSoup import BeautifulSoup
    import re
    
    html_text = """
    <h2>this is cool #12345678901</h2>
    <h2>this is nothing</h2>
    <h1>foo #126666678901</h1>
    <h2>this is interesting #126666678901</h2>
    <h2>this is blah #124445678901</h2>
    """
    
    soup = BeautifulSoup(html_text)
    
    
    for elem in soup(text=re.compile(r' #\S{11}')):
        print elem.parent
    

    Prints:

    <h2>this is cool #12345678901</h2>
    <h2>this is interesting #126666678901</h2>
    <h2>this is blah #124445678901</h2>
    
    0 讨论(0)
  • 2020-11-28 09:11

    BeautifulSoup search operations deliver [a list of] BeautifulSoup.NavigableString objects when text= is used as a criteria as opposed to BeautifulSoup.Tag in other cases. Check the object's __dict__ to see the attributes made available to you. Of these attributes, parent is favored over previous because of changes in BS4.

    from BeautifulSoup import BeautifulSoup
    from pprint import pprint
    import re
    
    html_text = """
    <h2>this is cool #12345678901</h2>
    <h2>this is nothing</h2>
    <h2>this is interesting #126666678901</h2>
    <h2>this is blah #124445678901</h2>
    """
    
    soup = BeautifulSoup(html_text)
    
    # Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
    pattern = re.compile(r'cool')
    
    pprint(soup.find(text=pattern).__dict__)
    #>> {'next': u'\n',
    #>>  'nextSibling': None,
    #>>  'parent': <h2>this is cool #12345678901</h2>,
    #>>  'previous': <h2>this is cool #12345678901</h2>,
    #>>  'previousSibling': None}
    
    print soup.find('h2')
    #>> <h2>this is cool #12345678901</h2>
    print soup.find('h2', text=pattern)
    #>> this is cool #12345678901
    print soup.find('h2', text=pattern).parent
    #>> <h2>this is cool #12345678901</h2>
    print soup.find('h2', text=pattern) == soup.find('h2')
    #>> False
    print soup.find('h2', text=pattern) == soup.find('h2').text
    #>> True
    print soup.find('h2', text=pattern).parent == soup.find('h2')
    #>> True
    
    0 讨论(0)
提交回复
热议问题