Given some random news article, I want to write a web crawler to find the largest body of text present, and extract it. The intention is to extract the physical news article on
You're really not going about it the right way, I would say, as all the comments above would attest to.
That said, this does what you're looking for.
from bs4 import BeautifulSoup as BS
import requests
html = requests.get('http://www.cnn.com/2013/01/04/justice/ohio-rape-online-video/index.html?hpt=hp_c2').text
soup = BS(html)
print '\n\n'.join([k.text for k in soup.find(class_='cnn_strycntntlft').find_all('p')])
It pulls out only the text, first by finding the main container of all the <p>
tags, then by selecting only the <p>
tags themselves to get the text; ignoring the <script>
and other irrelevant ones.
As was mentioned in the comments, this will only work for CNN--and possibly, only this page. You might need a different strategy for every new webpage.
You might look at the python-readability package which does exactly this for you.