Make BeautifulSoup handle line breaks as a browser would

后端未结

关注

 2  504

I\'m using BeautifulSoup (version \'4.3.2\' with Python 3.4) to convert html documents to text. The problem I\'m having is that sometimes web pages have newline characters

相关标签:

2条回答

囚心锁ツ

2021-01-17 21:43

get_text might be helpful here:

>>> from bs4 import BeautifulSoup
>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
>>> soup = BeautifulSoup(doc)
>>> soup.get_text(separator="\n")
u'This is a paragraph.\nThis is another paragraph.'

0 讨论(0)

不要未来只要你来

2021-01-17 21:45

I would take a look at python-markdownify. It turns html into pretty readable text in markdown format.

It is available at pypi : https://pypi.python.org/pypi/markdownify/0.4.0

and github : https://github.com/matthewwithanm/python-markdownify

0 讨论(0)
发布评论:

提交评论
- 加载中...