How to find spans with a specific class containing specific text using beautiful soup and re?

前端 未结 3 1687
無奈伤痛
無奈伤痛 2021-02-01 07:16

how can I find all span\'s with a class of \'blue\' that contain text in the format:

04/18/13 7:29pm

which could therefore be:

相关标签:
3条回答
  • 2021-02-01 07:30

    This is a flexible regex that you can use:

    "(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])"
    

    Example:

    >>> import re
    >>> from bs4 import BeautifulSoup
    >>> html = """
    <html>
    <body>
    <span class="blue">here is a lot of text that i don't need</span>
    <span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
    <span class="blue">04/19/13 7:30pm</span>
    <span class="blue">04/18/13 7:29pm</span>
    <span class="blue">Posted on 15/18/2013 10:00AM</span>
    <span class="blue">Posted on 04/20/13 10:31pm</span>
    <span class="blue">Posted on 4/1/2013 17:09aM</span>
    </body>
    </html>
    """
    >>> soup = BeautifulSoup(html)
    >>> lines = [i.get_text() for i in soup.find_all('span', {'class' : 'blue'})]
    >>> ok = [m.group(1)
          for line in lines
            for m in (re.search(r'(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])', line),)
              if m]
    >>> ok
    [u'04/18/13 7:29pm', u'04/19/13 7:30pm', u'04/18/13 7:29pm', u'15/18/2013 10:00AM', u'04/20/13 10:31pm', u'4/1/2013 17:09aM']
    >>> for i in ok:
        print i
    
    04/18/13 7:29pm
    04/19/13 7:30pm
    04/18/13 7:29pm
    15/18/2013 10:00AM
    04/20/13 10:31pm
    4/1/2013 17:09aM
    
    0 讨论(0)
  • 2021-02-01 07:46

    This pattern seems to satisfy what you're looking for:

    >>> pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
    >>> pattern.match('<span class="blue">here is a lot of text that i dont need</span>')
    >>> pattern.match('<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>').groups()
    ('04/18/13 7:29pm',)
    
    0 讨论(0)
  • 2021-02-01 07:54
    import re
    from bs4 import BeautifulSoup
    
    html_doc = """
    <html>
    <body>
    <span class="blue">here is a lot of text that i don't need</span>
    <span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
    <span class="blue">04/19/13 7:30pm</span>
    <span class="blue">Posted on 04/20/13 10:31pm</span>
    </body>
    </html>
    """
    
    # parse the html
    soup = BeautifulSoup(html_doc)
    
    # find a list of all span elements
    spans = soup.find_all('span', {'class' : 'blue'})
    
    # create a list of lines corresponding to element texts
    lines = [span.get_text() for span in spans]
    
    # collect the dates from the list of lines using regex matching groups
    found_dates = []
    for line in lines:
        m = re.search(r'(\d{2}/\d{2}/\d{2} \d+:\d+[a|p]m)', line)
        if m:
            found_dates.append(m.group(1))
    
    # print the dates we collected
    for date in found_dates:
        print(date)
    

    output:

    04/18/13 7:29pm
    04/19/13 7:30pm
    04/20/13 10:31pm
    
    0 讨论(0)
提交回复
热议问题