How to remove content in nested tags with BeautifulSoup?

前端 未结 3 1271
青春惊慌失措
青春惊慌失措 2021-01-20 15:34

How to remove content in nested tags with BeautifulSoup? These posts showed the reverse to retrieve the content in nested tags: How to get contents of nested ta

相关标签:
3条回答
  • 2021-01-20 15:58

    Here is my simple method, soup.body.clear() or soup.tag.clear()

    let's say you want to clear the content in <table></table> and add a new pandas dataframe; later you can use this clear method to easily update your tables in an html file for your webpage instead of flask/django:

        import pandas as pd
        import bs4
    

    I want to convert a 1.2million row .csv into a DataFrame, then into a HTML table, and then add it to my webpage's html syntax. Later I want to easily update the data whenever the csv gets updated by simply switching a variable

        bizcsv = read_csv("business.csv")
        dframe = pd.DataFrame(bizcsv)
        dfhtml = dframe.to_html #convert DataFrame to table, HTML format
        dfhtml_update = dfhtml_html.strip('<table border="1" class="dataframe">, </table>')
        """use dfhtml_update later to update your table without the <table> tags,
        the <table> is easy for BS to select & clear!"""
    
        #A small function to unescape (&lt; to <) the tags back into HTML format
        def unescape(s):
            s = s.replace("&lt;", "<")
            s = s.replace("&gt;", ">")
            # this has to be last:
            s = s.replace("&amp;", "&")
            return s
    
        with open("page.html") as page:  #return to here when updating
            txt = page.read()
            soup = bs4.BeautifulSoup(txt, features="lxml")
            soup.body.append(dfhtml) #adds table to <body>
            with open("page.html", "w") as outf:
                outf.write(unescape(str(soup))) #writes to page.html
    
        """lets say you want to make seamless table updates to your 
        webpage instead of using flask or django x_x; return to with open function"""
        soup.table.clear()  #clears everything in <table></table>
        soup.table.append(dfhtml_update)
        with open("page.html", "w") as outf:
            outf.write(unescape(str(soup))) 
    

    I'm a newbie, but after tons of searching I just combined a bunch of fundamental teachings from the documentation...Kind of bloated, but so is working with literally billions of cells of data. This works for me

    0 讨论(0)
  • 2021-01-20 15:59

    Eg.

    body = bs(html)
    for tag in body.find_all('bar'):
        tag.replace_with('')
    
    0 讨论(0)
  • 2021-01-20 16:02

    You can check for bs4.element.NavigableString on children:

    from bs4 import BeautifulSoup as bs
    import bs4
    html = "<foo>Something something <bar> blah blah</bar> something <bar2>GONE!</bar2> else</foo>"
    def get_only_text(elem):
        for item in elem.children:
            if isinstance(item,bs4.element.NavigableString):
                yield item
    
    print ''.join(get_only_text(bs(html).find_all('foo')[0]))
    

    Output;

    Something something  something  else
    
    0 讨论(0)
提交回复
热议问题