How to remove content in nested tags with BeautifulSoup
? These posts showed the reverse to retrieve the content in nested tags: How to get contents of nested ta
Here is my simple method, soup.body.clear()
or soup.tag.clear()
let's say you want to clear the content in <table></table>
and add a new pandas dataframe; later you can use this clear method to easily update your tables in an html file for your webpage instead of flask/django:
import pandas as pd
import bs4
I want to convert a 1.2million row .csv into a DataFrame, then into a HTML table, and then add it to my webpage's html syntax. Later I want to easily update the data whenever the csv gets updated by simply switching a variable
bizcsv = read_csv("business.csv")
dframe = pd.DataFrame(bizcsv)
dfhtml = dframe.to_html #convert DataFrame to table, HTML format
dfhtml_update = dfhtml_html.strip('<table border="1" class="dataframe">, </table>')
"""use dfhtml_update later to update your table without the <table> tags,
the <table> is easy for BS to select & clear!"""
#A small function to unescape (< to <) the tags back into HTML format
def unescape(s):
s = s.replace("<", "<")
s = s.replace(">", ">")
# this has to be last:
s = s.replace("&", "&")
return s
with open("page.html") as page: #return to here when updating
txt = page.read()
soup = bs4.BeautifulSoup(txt, features="lxml")
soup.body.append(dfhtml) #adds table to <body>
with open("page.html", "w") as outf:
outf.write(unescape(str(soup))) #writes to page.html
"""lets say you want to make seamless table updates to your
webpage instead of using flask or django x_x; return to with open function"""
soup.table.clear() #clears everything in <table></table>
soup.table.append(dfhtml_update)
with open("page.html", "w") as outf:
outf.write(unescape(str(soup)))
I'm a newbie, but after tons of searching I just combined a bunch of fundamental teachings from the documentation...Kind of bloated, but so is working with literally billions of cells of data. This works for me
Eg.
body = bs(html)
for tag in body.find_all('bar'):
tag.replace_with('')
You can check for bs4.element.NavigableString
on children:
from bs4 import BeautifulSoup as bs
import bs4
html = "<foo>Something something <bar> blah blah</bar> something <bar2>GONE!</bar2> else</foo>"
def get_only_text(elem):
for item in elem.children:
if isinstance(item,bs4.element.NavigableString):
yield item
print ''.join(get_only_text(bs(html).find_all('foo')[0]))
Output;
Something something something else