Top techniques to avoid 'data scraping' from a website database

后端 未结 14 1801
轻奢々
轻奢々 2020-12-25 14:02

I am setting up a site using PHP and MySQL that is essentially just a web front-end to an existing database. Understandably my client is very keen to prevent anyone from be

相关标签:
14条回答
  • 2020-12-25 14:28

    There's no easy solution for this. If the data is available publicly, then it can be scraped. The only thing you can do is make life more difficult for the scraper by making each entry slightly unique by adding/changing the HTML without affecting the layout. This would possibly make it more difficult for someone to harvest the data using regular expressions but it's still not a real solution and I would say that anyone determined enough would find a way to deal with it.

    I would suggest telling your client that this is an unachievable task and getting on with the important parts of your work.

    0 讨论(0)
  • 2020-12-25 14:33

    Using something like Adobe Flex - a Flash application front end - would fix this.

    Other than that, if you want it to be easy for users to access, it's easy for users to copy.

    0 讨论(0)
  • 2020-12-25 14:35

    I don't know why you'd deter this. The customer's offering the data.

    Presumably they create value in some unique way that's not trivially reflected in the data.

    Anyway.

    You can check the browser, screen resolution and IP address to see if it's likely some kind of automated scraper.

    Most things like cURL and wget -- unless carefully configured -- are pretty obviously not browsers.

    0 讨论(0)
  • 2020-12-25 14:36

    What about creating something akin to the bulletin board's troll protection... If a scrape is detected (perhaps a certain amount of accesses per minute from one IP, or a directed crawl that looks like a sitemap crawl), you can then start to present garbage data, like changing a couple of digits of the phone number or adding silly names to name fields.

    Turn this off for google IPs!

    0 讨论(0)
  • 2020-12-25 14:39

    If the data is published, it's visible and accessible to everyone on the Internet. This includes the people you want to see it and the people you don't.

    You can't have it both ways. You can make it so that data can only be visible with an account, and people will make accounts to slurp the data. You can make it so that the data can only be visible from approved IP addresses, and people will go through the steps to acquire approval before slurping it.

    Yes, you can make it hard to get, but if you want it to be convenient for typical users you need to make it convenient for malicious ones as well.

    0 讨论(0)
  • 2020-12-25 14:39

    Take your hands away from the keyboard and ask your client the reason why he wants the data to be visible but not be able to be scraped?

    He's asking for two incongruent things and maybe having a discussion as to his reasoning will yield some fruit.

    It may be that he really doesn't want it publicly accessible and you need to add authentication / authorization. Or he may decide that there is value in actually opening up an API. But you won't know until you ask.

    0 讨论(0)
提交回复
热议问题