Create Great Parser - Extract Relevant Text From HTML/Blogs

半城伤御伤魂 提交于 2019-11-29 20:28:38
Anirvan

Boy, do I have the perfect solution for you.

Arc90's readability algorithm does exactly this. Given HTML content, it picks out the content of the main blog post text, ignoring headers, footers, navigation, etc.

Here are implementations in:

I'll be releasing a Perl port to CPAN in a couple of days. Done.

Hope this helps!

There are projects out there that specifically look at filtering out the 'noise' of a given page. Typically the way this is done is by giving the algorithm a few examples of a given type of page, and it can look at what parts don't change between them. That being said, you'd have to give the algorithm a few example pages/posts of every blog you wanted to parse. This usually works well when you have a small defined set of sites you'll be crawling (news sites, for instance). The algorithm is basically detecting the template they use in HTML and picking out the interesting part. There's no magic here, it's tough and imperfect.

A great example of this alogrithm can be found in the EveryBlock.com source code which was just open-sourced. Go to everyblock.com/code and download the "ebdata" package and look at the "templatemaker" module.

And I don't mean to state the obvious, but have you considered just using RSS from the blogs in question? Usually the fields have the entire blog post, title, and other meta info along with them. Using RSS is going to be far simpler than the previous solution I mentioned.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!