Identifying large bodies of text via BeautifulSoup or other python based extractors

后端未结

关注

 2  376

Given some random news article, I want to write a web crawler to find the largest body of text present, and extract it. The intention is to extract the physical news article on

相关标签:

2条回答

误落风尘

2021-01-31 07:06
You're really not going about it the right way, I would say, as all the comments above would attest to.

That said, this does what you're looking for.
```
from bs4 import BeautifulSoup as BS
import requests
html = requests.get('http://www.cnn.com/2013/01/04/justice/ohio-rape-online-video/index.html?hpt=hp_c2').text
soup = BS(html)
print '\n\n'.join([k.text for k in soup.find(class_='cnn_strycntntlft').find_all('p')])
```
It pulls out only the text, first by finding the main container of all the <p> tags, then by selecting only the <p> tags themselves to get the text; ignoring the <script> and other irrelevant ones.

As was mentioned in the comments, this will only work for CNN--and possibly, only this page. You might need a different strategy for every new webpage.
0 讨论(0)
发布评论:

提交评论
- 加载中...
闹比i

2021-01-31 07:25

You might look at the python-readability package which does exactly this for you.

0 讨论(0)
发布评论:

提交评论
- 加载中...