Options for HTML scraping? [closed]

前端未结

关注

 30  2008

难免孤独

相关标签:

30条回答

南笙

2020-11-22 04:25

I've used Beautiful Soup a lot with Python. It is much better than regular expression checking, because it works like using the DOM, even if the HTML is poorly formatted. You can quickly find HTML tags and text with simpler syntax than regular expressions. Once you find an element, you can iterate over it and its children, which is more useful for understanding the contents in code than it is with regular expressions. I wish Beautiful Soup existed years ago when I had to do a lot of screenscraping -- it would have saved me a lot of time and headache since HTML structure was so poor before people started validating it.

0 讨论(0)
发布评论:

提交评论
- 加载中...
萌比男神i

2020-11-22 04:25

I've also had great success using Aptana's Jaxer + jQuery to parse pages. It's not as fast or 'script-like' in nature, but jQuery selectors + real JavaScript/DOM is a lifesaver on more complicated (or malformed) pages.

0 讨论(0)
发布评论:

提交评论
- 加载中...
悲哀的现实

2020-11-22 04:27
Python has several options for HTML scraping in addition to Beatiful Soup. Here are some others:
- mechanize: similar to perl WWW:Mechanize. Gives you a browser like object to ineract with web pages
- lxml: Python binding to libwww. Supports various options to traverse and select elements (e.g. XPath and CSS selection)
- scrapemark: high level library using templates to extract informations from HTML.
- pyquery: allows you to make jQuery like queries on XML documents.
- scrapy: an high level scraping and web crawling framework. It can be used to write spiders, for data mining and for monitoring and automated testing
0 讨论(0)
发布评论:

提交评论
- 加载中...
夕颜

2020-11-22 04:33

'Simple HTML DOM Parser' is a good option for PHP, if your familiar with jQuery or JavaScript selectors then you will find yourself at home.

Find it here

There is also a blog post about it here.

0 讨论(0)
发布评论:

提交评论
- 加载中...
粉色の甜心

2020-11-22 04:33

I have used LWP and HTML::TreeBuilder with Perl and have found them very useful.

LWP (short for libwww-perl) lets you connect to websites and scrape the HTML, you can get the module here and the O'Reilly book seems to be online here.

TreeBuilder allows you to construct a tree from the HTML, and documentation and source are available in HTML::TreeBuilder - Parser that builds a HTML syntax tree.

There might be too much heavy-lifting still to do with something like this approach though. I have not looked at the Mechanize module suggested by another answer, so I may well do that.

0 讨论(0)
发布评论:

提交评论
- 加载中...
我寻月下人不归

2020-11-22 04:34

There is this solution too: netty HttpClient

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 4 5 下一页

热议问题