Detecting 'stealth' web-crawlers

后端 未结 11 1491
小鲜肉
小鲜肉 2020-11-28 00:15

What options are there to detect web-crawlers that do not want to be detected?

(I know that listing detection techniques will allow the smart stealth-crawle

相关标签:
11条回答
  • 2020-11-28 00:55

    One simple bot detection method I've heard of for forms is the hidden input technique. If you are trying to secure a form put a input in the form with an id that looks completely legit. Then use css in an external file to hide it. Or if you are really paranoid, setup something like jquery to hide the input box on page load. If you do this right I imagine it would be very hard for a bot to figure out. You know those bots have it in there nature to fill out everything on a page especially if you give your hidden input an id of something like id="fname", etc.

    0 讨论(0)
  • 2020-11-28 00:56

    It's not actually that easy to keep up with the good user agent strings. Browser versions come and go. Making a statistic about user agent strings by different behaviors can reveal interesting things.

    I don't know how far this could be automated, but at least it is one differentiating thing.

    0 讨论(0)
  • 2020-11-28 00:59

    short answer: if a mid level programmer knows what he's doing you won't be able to detect a crawler without affecting the real user. Having your information publicly you won't be able to defend it against a crawler... it's like the 1st amendment right :)

    0 讨论(0)
  • 2020-11-28 01:00

    You can also check referrals. No referral could raise bot suspition. Bad referral means certainly it is not browser.

    Adding invisible links (possibly marked as rel="nofollow"?),

    * style="display: none;" on link or parent container
    * placed underneath another element with higher z-index
    

    I would'nt do that. You can end up blacklisted by google for black hat SEO :)

    0 讨论(0)
  • I currently work for a company that scans web sites in order to classify them. We also check sites for malware.

    In my experience the number one blockers of our web crawler (which of course uses a IE or Firefox UA and does not obey robots.txt. Duh.) are sites intentionally hosting malware. It's a pain because the site then falls back to a human who has to manually load the site, classify it and check it for malware.

    I'm just saying, by blocking web crawlers you're putting yourself in some bad company.

    Of course, if they are horribly rude and suck up tons of your bandwidth it's a different story because then you've got a good reason.

    0 讨论(0)
  • 2020-11-28 01:01

    One thing you didn't list, that are used commonly to detect bad crawlers.

    Hit speed, good web crawlers will break their hits up so they don't deluge a site with requests. Bad ones will do one of three things:

    1. hit sequential links one after the other
    2. hit sequential links in some paralell sequence (2 or more at a time.)
    3. hit sequential links at a fixed interval

    Also, some offline browsing programs will slurp up a number of pages, I'm not sure what kind of threshold you'd want to use, to start blocking by IP address.

    This method will also catch mirroring programs like fmirror or wget.

    If the bot randomizes the time interval, you could check to see if the links are traversed in a sequential or depth-first manner, or you can see if the bot is traversing a huge amount of text (as in words to read) in a too-short period of time. Some sites limit the number of requests per hour, also.

    Actually, I heard an idea somewhere, I don't remember where, that if a user gets too much data, in terms of kilobytes, they can be presented with a captcha asking them to prove they aren't a bot. I've never seen that implemented though.

    Update on Hiding Links

    As far as hiding links goes, you can put a div under another, with CSS (placing it first in the draw order) and possibly setting the z-order. A bot could not ignore that, without parsing all your javascript to see if it is a menu. To some extent, links inside invisible DIV elements also can't be ignored without the bot parsing all the javascript.

    Taking that idea to completion, uncalled javascript which could potentially show the hidden elements would possilby fool a subset of javascript parsing bots. And, it is not a lot of work to implement.

    0 讨论(0)
提交回复
热议问题