Crawl only content from multiple different Websites

Deadly 提交于 2020-01-05 04:37:22

问题


currently I am working on a project, where i want to analyze different articles from different blogs, Magazine, etc. published online on their Website.

Therefore i have already built a Webcrawler using Python, which get me every new article as html.

Now here is the point, i want to Analyse the pure content (only the article, without comments or recommendations etc. ), but i cant access this content, without defining a regular expression, to extract the content from the html response i get. Regular Expressions for each source is not a alternative, because i have around 100 different Sources for the articles.

I have tried to use the library html2text to extract the content, but the library only transforms the pure html to markdown, so there is still stuff like comments or recommendations, which i have to remove manually.

Any thoughts, how i can face this problem?


回答1:


In order to get the main article text and ignore extraneous text, you'd have to write code for specific webpages or devise some heuristics to identify and extract article content.

Luckily there are existing libraries that address this problem.

Newspaper is a Python 3 library:

from newspaper import Article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.download()
print(article.text)

You may also want to check out similar libraries such as python-readability or python-goose:

from goose import Goose
url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'
g = Goose()
article = g.extract(url=url)
print(article.cleaned_text)


来源:https://stackoverflow.com/questions/55714934/crawl-only-content-from-multiple-different-websites

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!