Python/BeautifulSoup - how to remove all tags from an element?

后端 未结 7 1514
执念已碎
执念已碎 2020-11-28 06:01

How can I simply strip all tags from an element I find in BeautifulSoup?

相关标签:
7条回答
  • 2020-11-28 06:25

    Here is the source code: you can get the text which is exactly in the URL

    URL = ''
    page = requests.get(URL)
    soup = bs4.BeautifulSoup(page.content,'html.parser').get_text()
    print(soup)
    
    0 讨论(0)
  • 2020-11-28 06:32

    it looks like this is the way to do! as simple as that

    with this line you are joining together the all text parts within the current element

    ''.join(htmlelement.find(text=True))
    
    0 讨论(0)
  • 2020-11-28 06:38

    With BeautifulStoneSoup gone in bs4, it's even simpler in Python3

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html)
    text = soup.get_text()
    print(text)
    
    0 讨论(0)
  • 2020-11-28 06:38

    You can use the decompose method in bs4:

    soup = bs4.BeautifulSoup('<body><a href="http://example.com/">I linked to <i>example.com</i></a></body>')
    
    for a in soup.find('a').children:
        if isinstance(a,bs4.element.Tag):
            a.decompose()
    
    print soup
    
    Out: <html><body><a href="http://example.com/">I linked to </a></body></html>
    
    0 讨论(0)
  • 2020-11-28 06:40

    why has no answer I've seen mentioned anything about the unwrap method? Or, even easier, the get_text method

    http://www.crummy.com/software/BeautifulSoup/bs4/doc/#unwrap http://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text

    0 讨论(0)
  • 2020-11-28 06:47

    Use get_text(), it returns all the text in a document or beneath a tag, as a single Unicode string.

    For instance, remove all different script tags from the following text:

    <td><a href="http://www.irit.fr/SC">Signal et Communication</a>
    <br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
    </td>
    

    The expected result is:

    Signal et Communication
    Ingénierie Réseaux et Télécommunications
    

    Here is the source code:

    #!/usr/bin/env python3
    from bs4 import BeautifulSoup
    
    text = '''
    <td><a href="http://www.irit.fr/SC">Signal et Communication</a>
    <br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
    </td>
    '''
    soup = BeautifulSoup(text)
    
    print(soup.get_text())
    
    0 讨论(0)
提交回复
热议问题