发表新帖

发表新帖

Which web crawler for extracting and parsing data from about a thousand of web sites

前端未结

关注

 3  1637

庸人自扰 2021-02-06 15:45

I\'m trying to crawl about a thousand of web sites, from which I\'m interested in the html content only.

Then I transform the HTML into XML to be parsed with Xpath to ex

3条回答

清歌不尽 (楼主)

2021-02-06 16:15

I would not use the 2.x branch (which has been discontinued) or the 3.x (current development) for any 'serious' crawling unless you want to help improve Heritrix or just like being on the bleeding edge.

Heritrix 1.14.3 is the most recent stable release and it really is stable, used by many institutions for both small and large scale crawling. I'm using to run crawls against tens of thousands of domains, collecting tens of millions of URLs in under a week.

The 3.x branch is getting closer to a stable release, but even then I'd wait a bit for general use at The Internet Archive and others to improve its performance and stability.

Update: Since someone up-voted this recently I feel it is worth noting that Heritrix 3.x is now stable and is the recommended version for those starting out with Heritrix.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题