Find all the span styles with font size larger than the most common one via beautiful soup python

后端 未结 2 1262
刺人心
刺人心 2021-01-28 10:35

I understand how to obtain the text from a specific div or span style from this question: How to find the most common span styles

Now the diff

相关标签:
2条回答
  • 2021-01-28 11:28

    To find all the span styles with font sizes larger than the most common span style using BeautifulSoup, you need to parse each CSS style that has been returned.

    Parsing CSS is better done using a library such as cssutils. This would then let you access the fontSize attribute directly.

    This would have a value such as 12px which does not naturally sort correctly. To get around this, you could use a library such as natsort.

    So, first parse each of the styles into css objects. At the same time keep a list of all the soup for each span, along with the parsed CSS for the style.

    Now use the fontSize attribute as the key for sorting with natsort. This would give you a correctly sorted list of styles according to their font size, largest first (by using reverse=True). takewhile() is then used to create a list of all entries in the list up to the point where the size matches the most common one resulting in a list of entries larger than the most common one.

    from bs4 import BeautifulSoup
    from collections import Counter
    from itertools import takewhile    
    import cssutils
    import natsort
    
    html = """
        <span style="font-family: ArialMT; font-size:12px">1</span>
        <span style="font-family: ArialMT; font-size:14px">2</span>
        <span style="font-family: ArialMT; font-size:1px">3</span>
        <span style="font-family: Arial; font-size:12px">4</span>
        <span style="font-family: ArialMT; font-size:18px">5</span>
        <span style="font-family: ArialMT; font-size:15px">6</span>
        <span style="font-family: ArialMT; font-size:12px">7</span>
        """
    
    soup = BeautifulSoup(html, "html.parser")    
    style_counts = Counter()
    parsed_css_style = []       # Holds list of tuples (css_style, span)
    
    for span in soup.find_all('span', style=True):
        style_counts[span['style']] += 1
        parsed_css_style.append((cssutils.parseStyle(span['style']), span))
    
    most_common_style = style_counts.most_common(1)[0][0]
    most_common_css_style = cssutils.parseStyle(most_common_style)
    css_styles = natsort.natsorted(parsed_css_style, key=lambda x: x[0].fontSize, reverse=True)
    
    print "Styles larger than most common font size of {} are:".format(most_common_css_style.fontSize)
    
    for css_style, span in takewhile(lambda x: x[0].fontSize != most_common_css_style.fontSize, css_styles):
        print "  Font size: {:5}  Text: {}".format(css_style.fontSize, span.text)
    

    In the example shown, the most commonly used font size is 12px, so there are 3 other entries larger than this as follows:

    Styles larger than most common font size of 12px are:
      Font size: 18px   Text: 5
      Font size: 15px   Text: 6
      Font size: 14px   Text: 2
    

    To install you will probably need:

    pip install natsort
    pip install cssutils    
    

    Note, this does assume the font sizes used are consistent on your website, it is not able to compare different font metrics, only the numerical value.

    0 讨论(0)
  • 2021-01-28 11:29

    This may help you:-

        from bs4 import BeautifulSoup
        import re
    
        usedFontSize = [] #list of all font number used
    
        #Find all the span contains style 
        spans = soup.find_all('span',style=True)
        for span in spans:
            #print span['style']
            styleTag = span['style']
            fontSize = re.findall("font-size:(\d+)px",styleTag)
            usedFontSize.append(int(fontSize[0]))
    
        #Find most commanly used font size
        from collections import Counter
        count = Counter(usedFontSize)
        #Print list of all the font size with it's accurence.
        print count.most_common()
    
    0 讨论(0)
提交回复
热议问题