问题
I have several thousands HTML sites and I am trying to filter the text from these sites.
I am doing this with beautiful soup. get_text()
gives me to much unecessary information from these sites.
Therefore I wrote a loop:
l = []
for line in text5:
soup = bs(line, 'html.parser')
p_text = ' '.join(p.text for p in soup.find_all('p'))
k = p_text.replace('\n', '')
l.append(k)
But this loop gives me everything that was in a tag that starts with <p
.
For example:
I want everything between two plain <p>
tags.
But I also get the content from someting like this:
<p class="header-main__label"> bla ba </p>
.
Can I tell BeautifulSoup to just get the plain <p>
tags?
回答1:
You can set False
for class
and id
and it will get tags without class
and id
soup.find_all('p', {'class': False, 'id': False})
or (word class_
has _
because there is keyword class
in Python)
soup.find_all('p', class_=False, id=False)
from bs4 import BeautifulSoup as BS
text = '<p class="A">text A</p> <p>text B</p> <p id="C">text C</p>'
soup = BS(text, 'html.parser')
# ----
all_items = soup.find_all('p', {'class': False, 'id': False})
for item in all_items:
print(item.text)
# ---
all_items = soup.find_all('p', class_=False, id=False)
for item in all_items:
print(item.text)
EDIT: If you want tags without any attributes then you can filter items using not item.attrs
for item in all_items:
if not item.attrs:
print(item.text)
from bs4 import BeautifulSoup as BS
text = '<p class="A">text A</p> <p>text B</p> <p id="C">text C</p> <p data="D">text D</p>'
soup = BS(text, 'html.parser')
all_items = soup.find_all('p')
for item in all_items:
if not item.attrs:
print(item.text)
来源:https://stackoverflow.com/questions/58747286/is-it-possible-to-just-get-the-tags-without-a-class-or-id-with-beautifulsoup