I\'m trying to extract some text using BeautifulSoup
. I\'m using get_text()
function for this purpose.
My problem is that the text contain
Adding to Ian's and dividebyzero's post/comments you can do this to efficiently filter/replace many tags in one go:
for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
elem.replace_with(elem.text + "\n\n")
A regex should do the trick.
import re
s = re.sub('<br\s*?>', '\n', yourTextHere)
Hope this helps!
As official doc says:
You can specify a string to be used to join the bits of text together: soup.get_text("\n")
If you call element.text
you'll get the text without br tags.
Maybe you need define your own custom method for this purpose:
def clean_text(elem):
text = ''
for e in elem.descendants:
if isinstance(e, str):
text += e.strip()
elif e.name == 'br' or e.name == 'p':
text += '\n'
return text
# get page content
soup = BeautifulSoup(request_response.text, 'html.parser')
# get your target element
description_div = soup.select_one('.description-class')
# clean the data
print(clean_text(description_div))
Instead of replacing the tags with \n, it may be better to just add a \n to the end of all of the tags that matter.
To steal the list from @petezurich:
for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
elem.append('\n')
You can do this using the BeautifulSoup object itself, or any element of it:
for br in soup.find_all("br"):
br.replace_with("\n")