Top techniques to avoid 'data scraping' from a website database

后端 未结 14 1799
轻奢々
轻奢々 2020-12-25 14:02

I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from be

相关标签:
14条回答
  • 2020-12-25 14:17

    Try using Flash or Silverlight for your frontend.

    While this can't stop someone if they're really determined, it would be more difficult. If you're loading your data through services, you can always use a secure connection to prevent middleman scraping.

    0 讨论(0)
  • 2020-12-25 14:19

    My suggestion would be that this is illegal anyways so at least you have legal recourse if someone does scrape the website. So maybe the best thing to do would just to include a link to the original site and let people scrape away. The more they scrape the more of your links will appear around the Internet building up your pagerank more and more.

    People who scrape usually aren't opposed to including a link to the original site since it builds a sort of rapport with the original author.

    So my advice is to ask your boss whether this could actually be the best thing possible for the website's health.

    0 讨论(0)
  • 2020-12-25 14:20

    While there's nothing to stop a determined person from scraping publically available content, you can do a few basic things to mitigate the client's concerns:

    • Rate limit by user account, IP address, user agent, etc... - this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account or IP address.

    • Require JavaScript - to ensure the client has some resemblance of an interactive browser, rather than a barebones spider...

    • RIA - make your data available through a Rich Internet Application interface. JavaScript-based grids include ExtJs, YUI, Dojo, etc. Richer environments include Flash and Silverlight as 1kevgriff mentions.

    • Encode data as images. This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn't foolproof of course.

    • robots.txt - to deny obvious web spiders, known robot user agents.

      User-agent: *

      Disallow: /

    • Use robot metatags. This would stop conforming spiders. This will prevent Google from indexing you for instance:

      <meta name="robots" content="noindex,follow,noarchive">

    There are different levels of deterrence and the first option is probably the least intrusive.

    0 讨论(0)
  • 2020-12-25 14:22

    Normally to screen-scrape a decent amount one has to make hundreds, thousands (and more) requests to your server. I suggest you read this related Stack Overflow question:

    How do you stop scripters from slamming your website hundreds of times a second?

    0 讨论(0)
  • 2020-12-25 14:26

    There are few ways you can do it, although none are ideal.

    1. Present the data as an image instead of HTML. This requires extra processing on the server side, but wouldn't be hard with the graphics libs in PHP. Alternatively, you could do this just for requests over a certain size (i.e. all).

    2. Load a page shell, then retrieve the data through an AJAX call and insert it into the DOM. Use sessions to set a hash that must be passed back with the AJAX call as verification. The hash would only be valid for a certain length of time (i.e. 10 seconds). This is really just adding an extra step someone would have to jump through to get the data, but would prevent simple page scraping.

    0 讨论(0)
  • 2020-12-25 14:27

    Use the fact that scrapers tend to load many pages in quick succession to detect scraping behaviours. Display a CAPTCHA for every n page loads over x seconds, and/or include an exponentially growing delay for each page load that becomes quite long when say tens of pages are being loaded each minute.

    This way normal users will probably never see your CAPTCHA but scrapers will quickly hit the limit that forces them to solve CAPTCHAs.

    0 讨论(0)
提交回复
热议问题