How to prevent unauthorized spidering

后端 未结 6 1138
刺人心
刺人心 2021-02-06 04:59

I want to prevent automated html scraping from one of our sites while not affecting legitimate spidering (googlebot, etc.). Is there something that already exists to accomplish

6条回答
  •  -上瘾入骨i
    2021-02-06 05:33

    I agree with the honeypot approach generally. However, I put the ONLY link to the honeypot page/resource on a page blocked by "/robots.txt" - as well as the honeypot blocked by such. This way, the malicious robot has to violate the "disallow" rule(s) TWICE to ban itself. A typical user manually following an unclickable link is likely only to do this once and may not find the page containing the honeypot URL.

    The honeypot resource logs the offending IP address of the malicious client into a file which is used as an IP ban list elsewhere in the web server configuration. This way, once listed, the web server blocks all further access by that client IP address until the list is cleared. Others may have some sort of automatic expiration, but I believe only in manual removal from a ban list.

    Aside: I also do the same thing with spam and my mail server: Sites which send me spam as their first message get banned from sending any further messages until I clear the log file. Although I implement these ban lists at the application level, I also have firewall level dynamic ban lists. My mail and web servers also share banned IP information between them. For an unsophisticated spammer, I figured that the same IP address may host both a malicious spider and a spam spewer. Of course, that was pre-BotNet, but I never removed it.

提交回复
热议问题