Get contents by class names using Beautiful Soup

后端 未结 6 495
有刺的猬
有刺的猬 2020-12-28 19:43

Using Beautiful Soup module, how can I get data of a div tag whose class name is feeditemcontent cxfeeditemcontent? Is it:

soup.cla         


        
相关标签:
6条回答
  • 2020-12-28 19:58

    Check this bug report: https://bugs.launchpad.net/beautifulsoup/+bug/410304

    As you can see, Beautiful soup can not really understand class="a b" as two classes a and b.

    However, as it appears in the first comment there, a simple regexp should suffice. In your case:

    soup = BeautifulSoup(html_doc)
    for x in soup.findAll("div",{"class":re.compile(r"\bfeeditemcontent\b")}):
        print "result: ",x
    

    Note: That has been fixed in the recent beta. I haven't gone through the docs of the recent versions, may be you could do that. Or if you want to get it working using the older version, you could use the above.

    0 讨论(0)
  • 2020-12-28 19:59

    soup.findAll("div", class_="feeditemcontent cxfeeditemcontent")

    So, If I want to get all div tags of class header <div class="header"> from stackoverflow.com, an example with BeautifulSoup would be something like:

    from bs4 import BeautifulSoup as bs
    import requests 
    
    url = "http://stackoverflow.com/"
    html = requests.get(url).text
    soup = bs(html)
    
    tags = soup.findAll("div", class_="header")
    

    It is already in bs4 documentation.

    0 讨论(0)
  • 2020-12-28 20:06
    soup.find("div", {"class" : "feeditemcontent cxfeeditemcontent"})
    
    0 讨论(0)
  • 2020-12-28 20:12
    from BeautifulSoup import BeautifulSoup 
    f = open('a.htm')
    soup = BeautifulSoup(f) 
    list = soup.findAll('div', attrs={'id':'abc def'})
    print list
    
    0 讨论(0)
  • 2020-12-28 20:19

    Try this, maybe it's too much for this simple thing but it works:

    def match_class(target):
        target = target.split()
        def do_match(tag):
            try:
                classes = dict(tag.attrs)["class"]
            except KeyError:
                classes = ""
            classes = classes.split()
            return all(c in classes for c in target)
        return do_match
    
    html = """<div class="feeditemcontent cxfeeditemcontent">
    <div class="feeditembodyandfooter">
    <div class="feeditembody">
    <span>The actual data is some where here</span>
    </div>
    </div>
    </div>"""
    
    from BeautifulSoup import BeautifulSoup
    
    soup = BeautifulSoup(html)
    
    matches = soup.findAll(match_class("feeditemcontent cxfeeditemcontent"))
    for m in matches:
        print m
        print "-"*10
    
    matches = soup.findAll(match_class("feeditembody"))
    for m in matches:
        print m
        print "-"*10
    
    0 讨论(0)
  • 2020-12-28 20:22

    Beautiful Soup 4 treats the value of the "class" attribute as a list rather than a string, meaning jadkik94's solution can be simplified:

    from bs4 import BeautifulSoup                                                   
    
    def match_class(target):                                                        
        def do_match(tag):                                                          
            classes = tag.get('class', [])                                          
            return all(c in classes for c in target)                                
        return do_match                                                             
    
    soup = BeautifulSoup(html)                                                      
    print soup.find_all(match_class(["feeditemcontent", "cxfeeditemcontent"]))
    
    0 讨论(0)
提交回复
热议问题