Detecting 'stealth' web-crawlers

后端未结

关注

 11  1491

小鲜肉

What options are there to detect web-crawlers that do not want to be detected?

(I know that listing detection techniques will allow the smart stealth-crawle

相关标签:

11条回答

Happy的楠姐

2020-11-28 00:55

One simple bot detection method I've heard of for forms is the hidden input technique. If you are trying to secure a form put a input in the form with an id that looks completely legit. Then use css in an external file to hide it. Or if you are really paranoid, setup something like jquery to hide the input box on page load. If you do this right I imagine it would be very hard for a bot to figure out. You know those bots have it in there nature to fill out everything on a page especially if you give your hidden input an id of something like id="fname", etc.

0 讨论(0)
发布评论:

提交评论
- 加载中...
佛祖请我去吃肉

2020-11-28 00:56

It's not actually that easy to keep up with the good user agent strings. Browser versions come and go. Making a statistic about user agent strings by different behaviors can reveal interesting things.

I don't know how far this could be automated, but at least it is one differentiating thing.

0 讨论(0)
发布评论:

提交评论
- 加载中...
北海茫月

2020-11-28 00:59

short answer: if a mid level programmer knows what he's doing you won't be able to detect a crawler without affecting the real user. Having your information publicly you won't be able to defend it against a crawler... it's like the 1st amendment right :)

0 讨论(0)
发布评论:

提交评论
- 加载中...
庸人自扰

2020-11-28 01:00
You can also check referrals. No referral could raise bot suspition. Bad referral means certainly it is not browser.

Adding invisible links (possibly marked as rel="nofollow"?),
```
* style="display: none;" on link or parent container
* placed underneath another element with higher z-index
```
I would'nt do that. You can end up blacklisted by google for black hat SEO :)
0 讨论(0)
发布评论:

提交评论
- 加载中...
不要未来只要你来

2020-11-28 01:00

I currently work for a company that scans web sites in order to classify them. We also check sites for malware.

In my experience the number one blockers of our web crawler (which of course uses a IE or Firefox UA and does not obey robots.txt. Duh.) are sites intentionally hosting malware. It's a pain because the site then falls back to a human who has to manually load the site, classify it and check it for malware.

I'm just saying, by blocking web crawlers you're putting yourself in some bad company.

Of course, if they are horribly rude and suck up tons of your bandwidth it's a different story because then you've got a good reason.

0 讨论(0)
发布评论:

提交评论
- 加载中...
既然无缘

2020-11-28 01:01
One thing you didn't list, that are used commonly to detect bad crawlers.

Hit speed, good web crawlers will break their hits up so they don't deluge a site with requests. Bad ones will do one of three things:
1. hit sequential links one after the other
2. hit sequential links in some paralell sequence (2 or more at a time.)
3. hit sequential links at a fixed interval
Also, some offline browsing programs will slurp up a number of pages, I'm not sure what kind of threshold you'd want to use, to start blocking by IP address.

This method will also catch mirroring programs like fmirror or wget.

If the bot randomizes the time interval, you could check to see if the links are traversed in a sequential or depth-first manner, or you can see if the bot is traversing a huge amount of text (as in words to read) in a too-short period of time. Some sites limit the number of requests per hour, also.

Actually, I heard an idea somewhere, I don't remember where, that if a user gets too much data, in terms of kilobytes, they can be presented with a captcha asking them to prove they aren't a bot. I've never seen that implemented though.
Update on Hiding Links
As far as hiding links goes, you can put a div under another, with CSS (placing it first in the draw order) and possibly setting the z-order. A bot could not ignore that, without parsing all your javascript to see if it is a menu. To some extent, links inside invisible DIV elements also can't be ignored without the bot parsing all the javascript.

Taking that idea to completion, uncalled javascript which could potentially show the hidden elements would possilby fool a subset of javascript parsing bots. And, it is not a lot of work to implement.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页