Top techniques to avoid 'data scraping' from a website database

后端 未结 14 1802
轻奢々
轻奢々 2020-12-25 14:02

I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from be

14条回答
  •  礼貌的吻别
    2020-12-25 14:20

    While there's nothing to stop a determined person from scraping publically available content, you can do a few basic things to mitigate the client's concerns:

    • Rate limit by user account, IP address, user agent, etc... - this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account or IP address.

    • Require JavaScript - to ensure the client has some resemblance of an interactive browser, rather than a barebones spider...

    • RIA - make your data available through a Rich Internet Application interface. JavaScript-based grids include ExtJs, YUI, Dojo, etc. Richer environments include Flash and Silverlight as 1kevgriff mentions.

    • Encode data as images. This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn't foolproof of course.

    • robots.txt - to deny obvious web spiders, known robot user agents.

      User-agent: *

      Disallow: /

    • Use robot metatags. This would stop conforming spiders. This will prevent Google from indexing you for instance:

    There are different levels of deterrence and the first option is probably the least intrusive.

提交回复
热议问题