How to prevent unauthorized spidering

后端 未结 6 1135
刺人心
刺人心 2021-02-06 04:59

I want to prevent automated html scraping from one of our sites while not affecting legitimate spidering (googlebot, etc.). Is there something that already exists to accomplish

6条回答
  •  青春惊慌失措
    2021-02-06 05:55

    You should do what good firewalls do when they detect malicious use - let them keep going but don't give them anything else. If you start throwing 403 or 404 they'll know something is wrong. If you return random data they'll go about their business.

    For detecting malicious use though, try adding a trap link on search results page (or the page they are using as your site map) and hide it with CSS. Need to check if they are claiming to be a valid bot and let them through though. You can store their IP for future use and a quick ARIN WHOIS search.

提交回复
热议问题