Situation:
Point 1 has the problem you have mentioned yourself. Also it doesn't help against a slower crawl of the site, or if it does then it may be even worse for legitimate heavy users.
You could turn point 2 around and only allow the user-agents you trust. Of course this won't help against a tool that fakes a standard user-agent.
A variation on point 3 would just be to send a notification to the site owners, then they can decide what to do with that user.
Similarly for my variation on point 2, you could make this a softer action, and just notify that somebody is accessing the site with a weird user agent.
edit: Related, I once had a weird issue when I was accessing a URL of my own that was not public (I was just staging a site that I hadn't announced or linked anywhere). Although nobody should have even known this URL but me, all of a sudden I noticed hits in the logs. When I tracked this down, I saw it was from some content filtering site. Turned out that my mobile ISP used a third party to block content, and it intercepted my own requests - since it didn't know the site, it then fetched the page I was trying to access and (I assume) did some keyword analysis in order to decide whether or not to block. This kind of thing might be a tail end case you need to watch out for.
@frankodwyer:
@Greg Hewgill:
Randomly changing logout/disable-url for 3. would be interesting, but don't know how I would implement it yet :)
http://recaptcha.net
Either every time someone logs in or while signing up. Maybe you could show a captcha every tenth time.
Added comments:
Again, from a service provider point of view I'm also interested that one user (crawler) doesn't consume cpu/bandwidth for others so any good bandwidth/request limiters you can point out?
response to comment: Platform specifications: Application based on JBoss Seam running on JBoss AS. However there is an apache2 in front of it. (running on linux)
Depending on what kind of malicious user are we talking about.
If they know how to use wget, they can probably set up Tor and get new IP every time, slowly copying everything you have. I don't think you can prevent that without inconveniencing your (paying?) users.
It is same as DRM on games, music, video. If end-user is supposed to see something, you cannot protect it.
The problem with option 3 is that the auto-logout would be trivial to avoid once the scraper figures out what is going on.