How to prevent unauthorized spidering

后端未结

关注

 6  1139

刺人心 2021-02-06 04:59

I want to prevent automated html scraping from one of our sites while not affecting legitimate spidering (googlebot, etc.). Is there something that already exists to accomplish

6条回答

无人共我 (楼主)

2021-02-06 05:37

This is difficult if not impossible to accomplish. Many "rogue" spiders/crawlers do not identify themselves via the user agent string, so it is difficult to identify them. You can try to block them via their IP address, but it is difficult to keep up with adding new IP addresses to your block list. It is also possible to block legitimate users if IP addresses are used since proxies make many different clients appear as a single IP address.

The problem with using robots.txt in this situation is that the spider can just choose to ignore it.

EDIT: Rate limiting is a possibility, but it suffers from some of the same problems of identifying (and keeping track of) "good" and "bad" user agents/IPs. In a system we wrote to do some internal page view/session counting, we eliminate sessions based on page view rate, but we also don't worry about eliminating "good" spiders since we don't want them counted in the data either. We don't do anything about preventing any client from actually viewing the pages.

0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...