How to select div by text content using Beautiful Soup?

后端 未结 3 655
轮回少年
轮回少年 2021-02-08 11:20

Trying to scrape some HTML from something like this. Sometimes the data I need is in div[0], sometimes div[1], etc.

Imagine everyone takes 3-5 classes. One of them is al

相关标签:
3条回答
  • 2021-02-08 11:56

    Another way (using css selector) is:

    divs = soup.select('div:contains("Biology")')

    EDIT:

    BeautifulSoup4 4.7.0+ (SoupSieve) is required

    0 讨论(0)
  • 2021-02-08 12:02

    (1) To just get the biology grade only, it is almost one liner.

    import bs4, re
    soup = bs4.BeautifulSoup(html)
    scores_string = soup.find_all(text=re.compile('Biology')) 
    scores = [score_string.split()[-1] for score_string in scores_string]
    print scores_string
    print scores
    

    The output looks like this:

    [u'Biology A+', u'Biology B', u'Biology B', u'Biology B', u'Biology B']
    [u'A+', u'B', u'B', u'B', u'B']
    

    (2) You locate the tags and maybe for further tasks, you need to find the parent:

    import bs4, re
    soup = bs4.BeautifulSoup(html)
    scores = soup.find_all(text=re.compile('Biology'))
    divs = [score.parent for score in scores]
    print divs
    

    Output looks like this:

    [<div class="score">Biology A+</div>, 
    <div class="score">Biology B</div>, 
    <div class="score">Biology B</div>, 
    <div class="score">Biology B</div>, 
    <div class="score">Biology B</div>]
    

    *In conclusion, you can use find_siblings/parent/...etc to move around the HTML tree.*

    More information about how to navigate the tree. And Good luck with your work.

    0 讨论(0)
  • 2021-02-08 12:12

    You can extract them searching for any <div> element that has score as class attribute value, and use a regular expression to extract its biology score:

    from bs4 import BeautifulSoup 
    import sys
    import re
    
    soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')
    
    for div in soup.find_all('div', attrs={'class': 'score'}):
        t = re.search(r'Biology\s+(\S+)', div.string)
        if t: print(t.group(1))
    

    Run it like:

    python3 script.py htmlfile
    

    That yields:

    A+
    B
    B
    B
    B
    
    0 讨论(0)
提交回复
热议问题