Using Beautiful Soup module, how can I get data of a div
tag whose class name is feeditemcontent cxfeeditemcontent
? Is it:
soup.cla
Check this bug report: https://bugs.launchpad.net/beautifulsoup/+bug/410304
As you can see, Beautiful soup can not really understand class="a b"
as two classes a
and b
.
However, as it appears in the first comment there, a simple regexp should suffice. In your case:
soup = BeautifulSoup(html_doc)
for x in soup.findAll("div",{"class":re.compile(r"\bfeeditemcontent\b")}):
print "result: ",x
Note: That has been fixed in the recent beta. I haven't gone through the docs of the recent versions, may be you could do that. Or if you want to get it working using the older version, you could use the above.
soup.findAll("div", class_="feeditemcontent cxfeeditemcontent")
So, If I want to get all div tags of class header <div class="header">
from stackoverflow.com, an example with BeautifulSoup would be something like:
from bs4 import BeautifulSoup as bs
import requests
url = "http://stackoverflow.com/"
html = requests.get(url).text
soup = bs(html)
tags = soup.findAll("div", class_="header")
It is already in bs4 documentation.
soup.find("div", {"class" : "feeditemcontent cxfeeditemcontent"})
from BeautifulSoup import BeautifulSoup
f = open('a.htm')
soup = BeautifulSoup(f)
list = soup.findAll('div', attrs={'id':'abc def'})
print list
Try this, maybe it's too much for this simple thing but it works:
def match_class(target):
target = target.split()
def do_match(tag):
try:
classes = dict(tag.attrs)["class"]
except KeyError:
classes = ""
classes = classes.split()
return all(c in classes for c in target)
return do_match
html = """<div class="feeditemcontent cxfeeditemcontent">
<div class="feeditembodyandfooter">
<div class="feeditembody">
<span>The actual data is some where here</span>
</div>
</div>
</div>"""
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
matches = soup.findAll(match_class("feeditemcontent cxfeeditemcontent"))
for m in matches:
print m
print "-"*10
matches = soup.findAll(match_class("feeditembody"))
for m in matches:
print m
print "-"*10
Beautiful Soup 4 treats the value of the "class" attribute as a list rather than a string, meaning jadkik94's solution can be simplified:
from bs4 import BeautifulSoup
def match_class(target):
def do_match(tag):
classes = tag.get('class', [])
return all(c in classes for c in target)
return do_match
soup = BeautifulSoup(html)
print soup.find_all(match_class(["feeditemcontent", "cxfeeditemcontent"]))