How to protect/monitor your site from crawling by malicious user

后端 未结 9 728

Situation:

  • Site with content protected by username/password (not all controlled since they can be trial/test users)
  • a normal search engine can\'t get at i
相关标签:
9条回答
  • 2021-02-06 19:13

    Point 1 has the problem you have mentioned yourself. Also it doesn't help against a slower crawl of the site, or if it does then it may be even worse for legitimate heavy users.

    You could turn point 2 around and only allow the user-agents you trust. Of course this won't help against a tool that fakes a standard user-agent.

    A variation on point 3 would just be to send a notification to the site owners, then they can decide what to do with that user.

    Similarly for my variation on point 2, you could make this a softer action, and just notify that somebody is accessing the site with a weird user agent.

    edit: Related, I once had a weird issue when I was accessing a URL of my own that was not public (I was just staging a site that I hadn't announced or linked anywhere). Although nobody should have even known this URL but me, all of a sudden I noticed hits in the logs. When I tracked this down, I saw it was from some content filtering site. Turned out that my mobile ISP used a third party to block content, and it intercepted my own requests - since it didn't know the site, it then fetched the page I was trying to access and (I assume) did some keyword analysis in order to decide whether or not to block. This kind of thing might be a tail end case you need to watch out for.

    0 讨论(0)
  • 2021-02-06 19:18

    @frankodwyer:

    • Only trusted user agents won't work, consider especially IE user-agent string which gets modified by addons or .net version. There would be too many possibilities and it can be faked.
    • variation on point 3. with notification to admin would probably work, but it would mean a non-determined delay if an admin isn't monitoring the logs constantly.

    @Greg Hewgill:

    • The auto-logout would also disable the user account. At the least a new account would have to be created leaving more trails like email-address and other information.

    Randomly changing logout/disable-url for 3. would be interesting, but don't know how I would implement it yet :)

    0 讨论(0)
  • 2021-02-06 19:18

    http://recaptcha.net

    Either every time someone logs in or while signing up. Maybe you could show a captcha every tenth time.

    0 讨论(0)
  • 2021-02-06 19:21

    Added comments:

    • I know you can't completely protect something that a normal user should be able to see. I've been on both sides of the problem :)
    • From a developer side what do you think is best ratio of time spent versus protected cases? I'd guess some simple user-agent checks would remove half or more of the potential crawlers, and I know you can spend months developing to protect from the last 1%

    Again, from a service provider point of view I'm also interested that one user (crawler) doesn't consume cpu/bandwidth for others so any good bandwidth/request limiters you can point out?

    response to comment: Platform specifications: Application based on JBoss Seam running on JBoss AS. However there is an apache2 in front of it. (running on linux)

    0 讨论(0)
  • 2021-02-06 19:23

    Depending on what kind of malicious user are we talking about.

    If they know how to use wget, they can probably set up Tor and get new IP every time, slowly copying everything you have. I don't think you can prevent that without inconveniencing your (paying?) users.

    It is same as DRM on games, music, video. If end-user is supposed to see something, you cannot protect it.

    0 讨论(0)
  • 2021-02-06 19:25

    The problem with option 3 is that the auto-logout would be trivial to avoid once the scraper figures out what is going on.

    0 讨论(0)
提交回复
热议问题