Identifying large bodies of text via BeautifulSoup or other python based extractors

后端 未结 2 363
独厮守ぢ
独厮守ぢ 2021-01-31 06:25

Given some random news article, I want to write a web crawler to find the largest body of text present, and extract it. The intention is to extract the physical news article on

相关标签:
2条回答
  • 2021-01-31 07:06

    You're really not going about it the right way, I would say, as all the comments above would attest to.

    That said, this does what you're looking for.

    from bs4 import BeautifulSoup as BS
    import requests
    html = requests.get('http://www.cnn.com/2013/01/04/justice/ohio-rape-online-video/index.html?hpt=hp_c2').text
    soup = BS(html)
    print '\n\n'.join([k.text for k in soup.find(class_='cnn_strycntntlft').find_all('p')])
    

    It pulls out only the text, first by finding the main container of all the <p> tags, then by selecting only the <p> tags themselves to get the text; ignoring the <script> and other irrelevant ones.

    As was mentioned in the comments, this will only work for CNN--and possibly, only this page. You might need a different strategy for every new webpage.

    0 讨论(0)
  • 2021-01-31 07:25

    You might look at the python-readability package which does exactly this for you.

    0 讨论(0)
提交回复
热议问题