Is it possible to just get the tags without a class or id with BeautifulSoup?

问题

I have several thousands HTML sites and I am trying to filter the text from these sites.

I am doing this with beautiful soup. get_text() gives me to much unecessary information from these sites.

Therefore I wrote a loop:

l = []
for line in text5:                   
    soup = bs(line, 'html.parser')
    p_text = ' '.join(p.text for p in soup.find_all('p'))  
    k = p_text.replace('\n', '')
    l.append(k)

But this loop gives me everything that was in a tag that starts with <p.

For example:

I want everything between two plain <p> tags. But I also get the content from someting like this:

<p class="header-main__label"> bla ba </p>.

Can I tell BeautifulSoup to just get the plain <p> tags?

回答1:

You can set False for class and id and it will get tags without class and id

soup.find_all('p', {'class': False, 'id': False})

or (word class_ has _ because there is keyword class in Python)

soup.find_all('p', class_=False, id=False)

from bs4 import BeautifulSoup as BS

text = '<p class="A">text A</p>  <p>text B</p>  <p id="C">text C</p>'

soup = BS(text, 'html.parser')

# ----

all_items = soup.find_all('p', {'class': False, 'id': False})

for item in all_items:
    print(item.text)

# ---

all_items = soup.find_all('p', class_=False, id=False)

for item in all_items:
    print(item.text)

EDIT: If you want tags without any attributes then you can filter items using not item.attrs

for item in all_items:
    if not item.attrs:
        print(item.text)

from bs4 import BeautifulSoup as BS

text = '<p class="A">text A</p> <p>text B</p> <p id="C">text C</p> <p data="D">text D</p>'

soup = BS(text, 'html.parser')

all_items = soup.find_all('p')

for item in all_items:
    if not item.attrs:
        print(item.text)

来源：https://stackoverflow.com/questions/58747286/is-it-possible-to-just-get-the-tags-without-a-class-or-id-with-beautifulsoup

标签

python

beautifulsoup