Prevent site data from being crawled and ripped

前端 未结 12 836
终归单人心
终归单人心 2020-12-15 06:32

I\'m looking into building a content site with possibly thousands of different entries, accessible by index and by search.

What are the measures I can take to preven

相关标签:
12条回答
  • 2020-12-15 06:33

    Realistically you can't stop malicious crawlers - and any measures that you put in place to prevent them are likely to harm your legitimate users (aside from perhaps adding entries to robots.txt to allow detection)

    So what you have to do is to plan on the content being stolen - it's more than likely to happen in one form or another - and understand how you will deal with unauthorized copying.

    Prevention isn't possible - and will be a waste of your time trying to make it so.

    The only sure way of making sure that the content on a website isn't vulnerable to copying is to unplug the network cable...

    To detect it use something like http://www.copyscape.com/ may help.

    0 讨论(0)
  • 2020-12-15 06:36

    Good crawlers will follow the rules you specify in your robots.txt, malicious ones will not. You can set up a "trap" for bad robots, like it is explained here: http://www.fleiner.com/bots/.
    But then again, if you put your content on the internet, I think it's better for everyone if it's as painless as possible to find (in fact, you're posting here and not at some lame forum where experts exchange their opinions)

    0 讨论(0)
  • 2020-12-15 06:38

    If the content is public and freely available, even with page view throttling or whatever, there is nothing you can do. If you require registration and/or payment to access the data, you might restrict it a bit, and at least you can see who reads what and identify the users that seem to be scraping your entire database.

    However I think you should rather face the fact that this is how the net works, there are not many ways to prevent a machine to read what a human can. Outputting all your content as images would of course discourage most, but then the site is not accessible anymore, let alone the fact that even the non-disabled users will not be able to copy-paste anything - which can be really annoying.

    All in all this sounds like DRM/game protection systems - pissing the hell out of your legit users only to prevent some bad behavior that you can't really prevent anyway.

    0 讨论(0)
  • 2020-12-15 06:39

    In short: you cannot prevent ripping. Malicious bots commonly use IE user agents and are fairly intelligent nowadays. If you want to have your site accessible to the maximum number (ie screenreaders, etc) you cannot use javascript or one of the popular plugins (flash) simply because they can inhibit a legitimate user's access.

    Perhaps you could have a cron job that picks a random snippet out of your database and googles it to check for matches. You could then try and get hold of the offending site and demand they take the content down.

    You could also monitor the number of requests from a given IP and block it if it passes a threshold, although you may have to whitelist legitimate bots and would be no use against a botnet (but if you are up against a botnet, perhaps ripping is not your biggest problem).

    0 讨论(0)
  • 2020-12-15 06:40

    Any site that it visible by human eyes is, in theory, potentially rippable. If you're going to even try to be accessible then this, by definition, must be the case (how else will speaking browsers be able to deliver your content if it isn't machine readable).

    Your best bet is to look into watermarking your content, so that at least if it does get ripped you can point to the watermarks and claim ownership.

    0 讨论(0)
  • 2020-12-15 06:40

    Don't even try to erect limits on the web!

    It really is as simple as this.

    Every potential measure to discourage ripping (aside from a very strict robots.txt) will harm your users. Captchas are more pain than gain. Checking the user agent shuts out unexpected browsers. The same is true for "clever" tricks with javascript.

    Please keep the web open. If you don't want anything to be taken from your website, then do not publish it there. Watermarks can help you claim ownership, but that only helps when you want to sue after the harm is done.

    0 讨论(0)
提交回复
热议问题