How to prevent unauthorized spidering

后端 未结 6 1140
刺人心
刺人心 2021-02-06 04:59

I want to prevent automated html scraping from one of our sites while not affecting legitimate spidering (googlebot, etc.). Is there something that already exists to accomplish

6条回答
  •  长情又很酷
    2021-02-06 05:56

    If you want to protect yourself from generic crawler, use a honeypot.

    See, for example, http://www.sqlite.org/cvstrac/honeypot. The good spider will not open this page because site's robots.txt disallows it explicitly. Human may open it, but is not supposed to click "i am a spider" link. The bad spider will certainly follow both links and so will betray its true identity.

    If the crawler is created specifically for your site, you can (in theory) create a moving honeypot.

提交回复
热议问题