I\'m using BeautifulSoup (version \'4.3.2\' with Python 3.4) to convert html documents to text. The problem I\'m having is that sometimes web pages have newline characters
get_text
might be helpful here:
>>> from bs4 import BeautifulSoup
>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
>>> soup = BeautifulSoup(doc)
>>> soup.get_text(separator="\n")
u'This is a paragraph.\nThis is another paragraph.'
I would take a look at python-markdownify. It turns html into pretty readable text in markdown format.
It is available at pypi : https://pypi.python.org/pypi/markdownify/0.4.0
and github : https://github.com/matthewwithanm/python-markdownify