Make BeautifulSoup handle line breaks as a browser would

后端 未结 2 501
一向
一向 2021-01-17 21:06

I\'m using BeautifulSoup (version \'4.3.2\' with Python 3.4) to convert html documents to text. The problem I\'m having is that sometimes web pages have newline characters

相关标签:
2条回答
  • 2021-01-17 21:43

    get_text might be helpful here:

    >>> from bs4 import BeautifulSoup
    >>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
    >>> soup = BeautifulSoup(doc)
    >>> soup.get_text(separator="\n")
    u'This is a paragraph.\nThis is another paragraph.'
    
    0 讨论(0)
  • I would take a look at python-markdownify. It turns html into pretty readable text in markdown format.

    It is available at pypi : https://pypi.python.org/pypi/markdownify/0.4.0

    and github : https://github.com/matthewwithanm/python-markdownify

    0 讨论(0)
提交回复
热议问题