I have a fairly large music website with a large artist database. I've been noticing other music sites scraping our site's data (I enter dummy Artist names here and there and then do google searches for them).
How can I prevent screen scraping? Is it even possible?
Note: Since the complete version of this answer exceeds Stack Overflow's length limit, you'll need to head to GitHub to read the extended version, with more tips and details.
In order to hinder scraping (also known as Webscraping, Screenscraping, Web data mining, Web harvesting, or Web data extraction), it helps to know how these scrapers work, and , by extension, what prevents them from working well.
There's various types of scraper, and each works differently:
Spiders, such as Google's bot or website copiers like HTtrack, which recursively follow links to other pages in order to get data. These are sometimes used for targeted scraping to get specific data, often in combination with a HTML parser to extract the desired data from each page.
Shell scripts: Sometimes, common Unix tools are used for scraping: Wget or Curl to download pages, and Grep (Regex) to extract the data.
HTML parsers, such as ones based on Jsoup, Scrapy, and others. Similar to shell-script regex based ones, these work by extracting data from pages based on patterns in HTML, usually ignoring everything else.
For example: If your website has a search feature, such a scraper might submit a request for a search, and then get all the result links and their titles from the results page HTML, in order to specifically get only search result links and their titles. These are the most common.
Screenscrapers, based on eg. Selenium or PhantomJS, which open your website in a real browser, run JavaScript, AJAX, and so on, and then get the desired text from the webpage, usually by:
Getting the HTML from the browser after your page has been loaded and JavaScript has run, and then using a HTML parser to extract the desired data. These are the most common, and so many of the methods for breaking HTML parsers / scrapers also work here.
Taking a screenshot of the rendered pages, and then using OCR to extract the desired text from the screenshot. These are rare, and only dedicated scrapers who really want your data will set this up.
Webscraping services such as ScrapingHub or Kimono. In fact, there's people whose job is to figure out how to scrape your site and pull out the content for others to use.
Unsurprisingly, professional scraping services are the hardest to deter, but if you make it hard and time-consuming to figure out how to scrape your site, these (and people who pay them to do so) may not be bothered to scrape your website.
Embedding your website in other site's pages with frames, and embedding your site in mobile apps.
While not technically scraping, mobile apps (Android and iOS) can embed websites, and inject custom CSS and JavaScript, thus completely changing the appearance of your pages.
Human copy - paste: People will copy and paste your content in order to use it elsewhere.
There is a lot overlap between these different kinds of scraper, and many scrapers will behave similarly, even if they use different technologies and methods.
These tips mostly my own ideas, various difficulties that I've encountered while writing scrapers, as well as bits of information and ideas from around the interwebs.
How to stop scraping
You can't completely prevent it, since whatever you do, determined scrapers can still figure out how to scrape. However, you can stop a lot of scraping by doing a few things:
Monitor your logs & traffic patterns; limit access if you see unusual activity:
Check your logs regularly, and in case of unusual activity indicative of automated access (scrapers), such as many similar actions from the same IP address, you can block or limit access.
Specifically, some ideas:
Rate limiting:
Only allow users (and scrapers) to perform a limited number of actions in a certain time - for example, only allow a few searches per second from any specific IP address or user. This will slow down scrapers, and make them ineffective. You could also show a captcha if actions are completed too fast or faster than a real user would.
Detect unusual activity:
If you see unusual activity, such as many similar requests from a specific IP address, someone looking at an excessive number of pages or performing an unusual number of searches, you can prevent access, or show a captcha for subsequent requests.
Don't just monitor & rate limit by IP address - use other indicators too:
If you do block or rate limit, don't just do it on a per-IP address basis; you can use other indicators and methods to identify specific users or scrapers. Some indicators which can help you identify specific users / scrapers include:
How fast users fill out forms, and where on a button they click;
You can gather a lot of information with JavaScript, such as screen size / resolution, timezone, installed fonts, etc; you can use this to identify users.
HTTP headers and their order, especially User-Agent.
As an example, if you get many request from a single IP address, all using the same User Agent, screen size (determined with JavaScript), and the user (scraper in this case) always clicks on the button in the same way and at regular intervals, it's probably a screen scraper; and you can temporarily block similar requests (eg. block all requests with that user agent and screen size coming from that particular IP address), and this way you won't inconvenience real users on that IP address, eg. in case of a shared internet connection.
You can also take this further, as you can identify similar requests, even if they come from different IP addresses, indicative of distributed scraping (a scraper using a botnet or a network of proxies). If you get a lot of otherwise identical requests, but they come from different IP addresses, you can block. Again, be aware of not inadvertently blocking real users.
This can be effective against screenscrapers which run JavaScript, as you can get a lot of information from them.
Related questions on Security Stack Exchange:
How to uniquely identify users with the same external IP address? for more details, and
Why do people use IP address bans when IP addresses often change? for info on the limits of these methods.
Instead of temporarily blocking access, use a Captcha:
The simple way to implement rate-limiting would be to temporarily block access for a certain amount of time, however using a Captcha may be better, see the section on Captchas further down.
Require registration & login
Require account creation in order to view your content, if this is feasible for your site. This is a good deterrent for scrapers, but is also a good deterrent for real users.
- If you require account creation and login, you can accurately track user and scraper actions. This way, you can easily detect when a specific account is being used for scraping, and ban it. Things like rate limiting or detecting abuse (such as a huge number of searches in a short time) become easier, as you can identify specific scrapers instead of just IP addresses.
In order to avoid scripts creating many accounts, you should:
Require an email address for registration, and verify that email address by sending a link that must be opened in order to activate the account. Allow only one account per email address.
Require a captcha to be solved during registration / account creation.
Requiring account creation to view content will drive users and search engines away; if you require account creation in order to view an article, users will go elsewhere.
Block access from cloud hosting and scraping service IP addresses
Sometimes, scrapers will be run from web hosting services, such as Amazon Web Services or GAE, or VPSes. Limit access to your website (or show a captcha) for requests originating from the IP addresses used by such cloud hosting services.
Similarly, you can also limit access from IP addresses used by proxy or VPN providers, as scrapers may use such proxy servers to avoid many requests being detected.
Beware that by blocking access from proxy servers and VPNs, you will negatively affect real users.
Make your error message nondescript if you do block
If you do block / limit access, you should ensure that you don't tell the scraper what caused the block, thereby giving them clues as to how to fix their scraper. So a bad idea would be to show error pages with text like:
Too many requests from your IP address, please try again later.
Error, User Agent header not present !
Instead, show a friendly error message that doesn't tell the scraper what caused it. Something like this is much better:
- Sorry, something went wrong. You can contact support via
helpdesk@example.com
, should the problem persist.
This is also a lot more user friendly for real users, should they ever see such an error page. You should also consider showing a captcha for subsequent requests instead of a hard block, in case a real user sees the error message, so that you don't block and thus cause legitimate users to contact you.
Use Captchas if you suspect that your website is being accessed by a scraper.
Captchas ("Completely Automated Test to Tell Computers and Humans apart") are very effective against stopping scrapers. Unfortunately, they are also very effective at irritating users.
As such, they are useful when you suspect a possible scraper, and want to stop the scraping, without also blocking access in case it isn't a scraper but a real user. You might want to consider showing a captcha before allowing access to the content if you suspect a scraper.
Things to be aware of when using Captchas:
Don't roll your own, use something like Google's reCaptcha : It's a lot easier than implementing a captcha yourself, it's more user-friendly than some blurry and warped text solution you might come up with yourself (users often only need to tick a box), and it's also a lot harder for a scripter to solve than a simple image served from your site
Don't include the solution to the captcha in the HTML markup: I've actually seen one website which had the solution for the captcha in the page itself, (although quite well hidden) thus making it pretty useless. Don't do something like this. Again, use a service like reCaptcha, and you won't have this kind of problem (if you use it properly).
Captchas can be solved in bulk: There are captcha-solving services where actual, low-paid, humans solve captchas in bulk. Again, using reCaptcha is a good idea here, as they have protections (such as the relatively short time the user has in order to solve the captcha). This kind of service is unlikely to be used unless your data is really valuable.
Serve your text content as an image
You can render text into an image server-side, and serve that to be displayed, which will hinder simple scrapers extracting text.
However, this is bad for screen readers, search engines, performance, and pretty much everything else. It's also illegal in some places (due to accessibility, eg. the Americans with Disabilities Act), and it's also easy to circumvent with some OCR, so don't do it.
You can do something similar with CSS sprites, but that suffers from the same problems.
Don't expose your complete dataset:
If feasible, don't provide a way for a script / bot to get all of your dataset. As an example: You have a news site, with lots of individual articles. You could make those articles be only accessible by searching for them via the on site search, and, if you don't have a list of all the articles on the site and their URLs anywhere, those articles will be only accessible by using the search feature. This means that a script wanting to get all the articles off your site will have to do searches for all possible phrases which may appear in your articles in order to find them all, which will be time-consuming, horribly inefficient, and will hopefully make the scraper give up.
This will be ineffective if:
- The bot / script does not want / need the full dataset anyway.
- Your articles are served from a URL which looks something like
example.com/article.php?articleId=12345
. This (and similar things) which will allow scrapers to simply iterate over all thearticleId
s and request all the articles that way. - There are other ways to eventually find all the articles, such as by writing a script to follow links within articles which lead to other articles.
- Searching for something like "and" or "the" can reveal almost everything, so that is something to be aware of. (You can avoid this by only returning the top 10 or 20 results).
- You need search engines to find your content.
Don't expose your APIs, endpoints, and similar things:
Make sure you don't expose any APIs, even unintentionally. For example, if you are using AJAX or network requests from within Adobe Flash or Java Applets (God forbid!) to load your data it is trivial to look at the network requests from the page and figure out where those requests are going to, and then reverse engineer and use those endpoints in a scraper program. Make sure you obfuscate your endpoints and make them hard for others to use, as described.
To deter HTML parsers and scrapers:
Since HTML parsers work by extracting content from pages based on identifiable patterns in the HTML, we can intentionally change those patterns in oder to break these scrapers, or even screw with them. Most of these tips also apply to other scrapers like spiders and screenscrapers too.
Frequently change your HTML
Scrapers which process HTML directly do so by extracting contents from specific, identifiable parts of your HTML page. For example: If all pages on your website have a div
with an id of article-content
, which contains the text of the article, then it is trivial to write a script to visit all the article pages on your site, and extract the content text of the article-content
div on each article page, and voilà, the scraper has all the articles from your site in a format that can be reused elsewhere.
If you change the HTML and the structure of your pages frequently, such scrapers will no longer work.
You can frequently change the id's and classes of elements in your HTML, perhaps even automatically. So, if your
div.article-content
becomes something likediv.a4c36dda13eaf0
, and changes every week, the scraper will work fine initially, but will break after a week. Make sure to change the length of your ids / classes too, otherwise the scraper will usediv.[any-14-characters]
to find the desired div instead. Beware of other similar holes too..If there is no way to find the desired content from the markup, the scraper will do so from the way the HTML is structured. So, if all your article pages are similar in that every
div
inside adiv
which comes after ah1
is the article content, scrapers will get the article content based on that. Again, to break this, you can add / remove extra markup to your HTML, periodically and randomly, eg. adding extradiv
s orspan
s. With modern server side HTML processing, this should not be too hard.
Things to be aware of:
It will be tedious and difficult to implement, maintain, and debug.
You will hinder caching. Especially if you change ids or classes of your HTML elements, this will require corresponding changes in your CSS and JavaScript files, which means that every time you change them, they will have to be re-downloaded by the browser. This will result in longer page load times for repeat visitors, and increased server load. If you only change it once a week, it will not be a big problem.
Clever scrapers will still be able to get your content by inferring where the actual content is, eg. by knowing that a large single block of text on the page is likely to be the actual article. This makes it possible to still find & extract the desired data from the page. Boilerpipe does exactly this.
Essentially, make sure that it is not easy for a script to find the actual, desired content for every similar page.
See also How to prevent crawlers depending on XPath from getting page contents for details on how this can be implemented in PHP.
Change your HTML based on the user's location
This is sort of similar to the previous tip. If you serve different HTML based on your user's location / country (determined by IP address), this may break scrapers which are delivered to users. For example, if someone is writing a mobile app which scrapes data from your site, it will work fine initially, but break when it's actually distributed to users, as those users may be in a different country, and thus get different HTML, which the embedded scraper was not designed to consume.
Frequently change your HTML, actively screw with the scrapers by doing so !
An example: You have a search feature on your website, located at example.com/search?query=somesearchquery
, which returns the following HTML:
<div class="search-result">
<h3 class="search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
<p class="search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
<a class"search-result-link" href="/stories/story-link">Read more</a>
</div>
(And so on, lots more identically structured divs with search results)
As you may have guessed this is easy to scrape: all a scraper needs to do is hit the search URL with a query, and extract the desired data from the returned HTML. In addition to periodically changing the HTML as described above, you could also leave the old markup with the old ids and classes in, hide it with CSS, and fill it with fake data, thereby poisoning the scraper. Here's how the search results page could be changed:
<div class="the-real-search-result">
<h3 class="the-real-search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
<p class="the-real-search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
<a class"the-real-search-result-link" href="/stories/story-link">Read more</a>
</div>
<div class="search-result" style="display:none">
<h3 class="search-result-title">Visit Example.com now, for all the latest Stack Overflow related news !</h3>
<p class="search-result-excerpt">Example.com is so awesome, visit now !</p>
<a class"search-result-link" href="http://example.com/">Visit Now !</a>
</div>
(More real search results follow)
This will mean that scrapers written to extract data from the HTML based on classes or IDs will continue to seemingly work, but they will get fake data or even ads, data which real users will never see, as they're hidden with CSS.
Screw with the scraper: Insert fake, invisible honeypot data into your page
Adding on to the previous example, you can add invisible honeypot items to your HTML to catch scrapers. An example which could be added to the previously described search results page:
<div class="search-result" style="display:none">
<h3 class="search-result-title">This search result is here to prevent scraping</h3>
<p class="search-result-excerpt">If you're a human and see this, please ignore it. If you're a scraper, please click the link below :-)
Note that clicking the link below will block access to this site for 24 hours.</p>
<a class"search-result-link" href="/scrapertrap/scrapertrap.php">I'm a scraper !</a>
</div>
(The actual, real, search results follow.)
A scraper written to get all the search results will pick this up, just like any of the other, real search results on the page, and visit the link, looking for the desired content. A real human will never even see it in the first place (due to it being hidden with CSS), and won't visit the link. A genuine and desirable spider such as Google's will not visit the link either because you disallowed /scrapertrap/
in your robots.txt.
You can make your scrapertrap.php
do something like block access for the IP address that visited it or force a captcha for all subsequent requests from that IP.
Don't forget to disallow your honeypot (
/scrapertrap/
) in your robots.txt file so that search engine bots don't fall into it.You can / should combine this with the previous tip of changing your HTML frequently.
Change this frequently too, as scrapers will eventually learn to avoid it. Change the honeypot URL and text. Also want to consider changing the inline CSS used for hiding, and use an ID attribute and external CSS instead, as scrapers will learn to avoid anything which has a
style
attribute with CSS used to hide the content. Also try only enabling it sometimes, so the scraper works initially, but breaks after a while. This also applies to the previous tip.Malicious people can prevent access for real users by sharing a link to your honeypot, or even embedding that link somewhere as an image (eg. on a forum). Change the URL frequently, and make any ban times relatively short.
Serve fake and useless data if you detect a scraper
If you detect what is obviously a scraper, you can serve up fake and useless data; this will corrupt the data the scraper gets from your website. You should also make it impossible to distinguish such fake data from real data, so that scrapers don't know that they're being screwed with.
As an example: you have a news website; if you detect a scraper, instead of blocking access, serve up fake, randomly generated articles, and this will poison the data the scraper gets. If you make your fake data indistinguishable from the real thing, you'll make it hard for scrapers to get what they want, namely the actual, real data.
Don't accept requests if the User Agent is empty / missing
Often, lazily written scrapers will not send a User Agent header with their request, whereas all browsers as well as search engine spiders will.
If you get a request where the User Agent header is not present, you can show a captcha, or simply block or limit access. (Or serve fake data as described above, or something else..)
It's trivial to spoof, but as a measure against poorly written scrapers it is worth implementing.
Don't accept requests if the User Agent is a common scraper one; blacklist ones used by scrapers
In some cases, scrapers will use a User Agent which no real browser or search engine spider uses, such as:
- "Mozilla" (Just that, nothing else. I've seen a few questions about scraping here, using that. A real browser will never use only that)
- "Java 1.7.43_u43" (By default, Java's HttpUrlConnection uses something like this.)
- "BIZCO EasyScraping Studio 2.0"
- "wget", "curl", "libcurl",.. (Wget and cURL are sometimes used for basic scraping)
If you find that a specific User Agent string is used by scrapers on your site, and it is not used by real browsers or legitimate spiders, you can also add it to your blacklist.
If it doesn't request assets (CSS, images), it's not a real browser.
A real browser will (almost always) request and download assets such as images and CSS. HTML parsers and scrapers won't as they are only interested in the actual pages and their content.
You could log requests to your assets, and if you see lots of requests for only the HTML, it may be a scraper.
Beware that search engine bots, ancient mobile devices, screen readers and misconfigured devices may not request assets either.
Use and require cookies; use them to track user and scraper actions.
You can require cookies to be enabled in order to view your website. This will deter inexperienced and newbie scraper writers, however it is easy to for a scraper to send cookies. If you do use and require them, you can track user and scraper actions with them, and thus implement rate-limiting, blocking, or showing captchas on a per-user instead of a per-IP basis.
For example: when the user performs search, set a unique identifying cookie. When the results pages are viewed, verify that cookie. If the user opens all the search results (you can tell from the cookie), then it's probably a scraper.
Using cookies may be ineffective, as scrapers can send the cookies with their requests too, and discard them as needed. You will also prevent access for real users who have cookies disabled, if your site only works with cookies.
Note that if you use JavaScript to set and retrieve the cookie, you'll block scrapers which don't run JavaScript, since they can't retrieve and send the cookie with their request.
Use JavaScript + Ajax to load your content
You could use JavaScript + AJAX to load your content after the page itself loads. This will make the content inaccessible to HTML parsers which do not run JavaScript. This is often an effective deterrent to newbie and inexperienced programmers writing scrapers.
Be aware of:
Using JavaScript to load the actual content will degrade user experience and performance
Search engines may not run JavaScript either, thus preventing them from indexing your content. This may not be a problem for search results pages, but may be for other things, such as article pages.
Obfuscate your markup, network requests from scripts, and everything else.
If you use Ajax and JavaScript to load your data, obfuscate the data which is transferred. As an example, you could encode your data on the server (with something as simple as base64 or more complex), and then decode and display it on the client, after fetching via Ajax. This will mean that someone inspecting network traffic will not immediately see how your page works and loads data, and it will be tougher for someone to directly request request data from your endpoints, as they will have to reverse-engineer your descrambling algorithm.
If you do use Ajax for loading the data, you should make it hard to use the endpoints without loading the page first, eg by requiring some session key as a parameter, which you can embed in your JavaScript or your HTML.
You can also embed your obfuscated data directly in the initial HTML page and use JavaScript to deobfuscate and display it, which would avoid the extra network requests. Doing this will make it significantly harder to extract the data using a HTML-only parser which does not run JavaScript, as the one writing the scraper will have to reverse engineer your JavaScript (which you should obfuscate too).
You might want to change your obfuscation methods regularly, to break scrapers who have figured it out.
There are several disadvantages to doing something like this, though:
It will be tedious and difficult to implement, maintain, and debug.
It will be ineffective against scrapers and screenscrapers which actually run JavaScript and then extract the data. (Most simple HTML parsers don't run JavaScript though)
It will make your site nonfunctional for real users if they have JavaScript disabled.
Performance and page-load times will suffer.
Non-Technical:
Tell people not to scrape, and some will respect it
Find a lawyer
Make your data available, provide an API:
You could make your data easily available and require attribution and a link back to your site. Perhaps charge $$$ for it.
Miscellaneous:
There are also commercial scraping protection services, such as the anti-scraping by Cloudflare or Distill Networks (Details on how it works here), which do these things, and more for you.
Find a balance between usability for real users and scraper-proofness: Everything you do will impact user experience negatively in one way or another, find compromises.
Don't forget your mobile site and apps. If you have a mobile app, that can be screenscraped too, and network traffic can be inspected to determine the REST endpoints it uses.
Scrapers can scrape other scrapers: If there's one website which has content scraped from yours, other scrapers can scrape from that scraper's website.
Further reading:
Wikipedia's article on Web scraping. Many details on the technologies involved and the different types of web scraper.
Stopping scripters from slamming your website hundreds of times a second. Q & A on a very similar problem - bots checking a website and buying things as soon as they go on sale. A lot of relevant info, esp. on Captchas and rate-limiting.
I will presume that you have set up robots.txt
.
As others have mentioned, scrapers can fake nearly every aspect of their activities, and it is probably very difficult to identify the requests that are coming from the bad guys.
I would consider:
- Set up a page,
/jail.html
. - Disallow access to the page in
robots.txt
(so the respectful spiders will never visit). - Place a link on one of your pages, hiding it with CSS (
display: none
). - Record IP addresses of visitors to
/jail.html
.
This might help you to quickly identify requests from scrapers that are flagrantly disregarding your robots.txt
.
You might also want to make your /jail.html
a whole entire website that has the same, exact markup as normal pages, but with fake data (/jail/album/63ajdka
, /jail/track/3aads8
, etc.). This way, the bad scrapers won't be alerted to "unusual input" until you have the chance to block them entirely.
Sue 'em.
Seriously: If you have some money, talk to a good, nice, young lawyer who knows their way around the Internets. You could really be able to do something here. Depending on where the sites are based, you could have a lawyer write up a cease & desist or its equivalent in your country. You may be able to at least scare the bastards.
Document the insertion of your dummy values. Insert dummy values that clearly (but obscurely) point to you. I think this is common practice with phone book companies, and here in Germany, I think there have been several instances when copycats got busted through fake entries they copied 1:1.
It would be a shame if this would drive you into messing up your HTML code, dragging down SEO, validity and other things (even though a templating system that uses a slightly different HTML structure on each request for identical pages might already help a lot against scrapers that always rely on HTML structures and class/ID names to get the content out.)
Cases like this are what copyright laws are good for. Ripping off other people's honest work to make money with is something that you should be able to fight against.
There is really nothing you can do to completely prevent this. Scrapers can fake their user agent, use multiple IP addresses, etc. and appear as a normal user. The only thing you can do is make the text not available at the time the page is loaded - make it with image, flash, or load it with JavaScript. However, the first two are bad ideas, and the last one would be an accessibility issue if JavaScript is not enabled for some of your regular users.
If they are absolutely slamming your site and rifling through all of your pages, you could do some kind of rate limiting.
There is some hope though. Scrapers rely on your site's data being in a consistent format. If you could randomize it somehow it could break their scraper. Things like changing the ID or class names of page elements on each load, etc. But that is a lot of work to do and I'm not sure if it's worth it. And even then, they could probably get around it with enough dedication.
Provide an XML API to access your data; in a manner that is simple to use. If people want your data, they'll get it, you might as well go all out.
This way you can provide a subset of functionality in an effective manner, ensuring that, at the very least, the scrapers won't guzzle up HTTP requests and massive amounts of bandwidth.
Then all you have to do is convince the people who want your data to use the API. ;)
Sorry, it's really quite hard to do this...
I would suggest that you politely ask them to not use your content (if your content is copyrighted).
If it is and they don't take it down, then you can take furthur action and send them a cease and desist letter.
Generally, whatever you do to prevent scraping will probably end up with a more negative effect, e.g. accessibility, bots/spiders, etc.
Okay, as all posts say, if you want to make it search engine-friendly then bots can scrape for sure.
But you can still do a few things, and it may be affective for 60-70 % scraping bots.
Make a checker script like below.
If a particular IP address is visiting very fast then after a few visits (5-10) put its IP address + browser information in a file or database.
The next step
(This would be a background process and running all time or scheduled after a few minutes.) Make one another script that will keep on checking those suspicious IP addresses.
Case 1. If the user Agent is of a known search engine like Google, Bing, Yahoo (you can find more information on user agents by googling it). Then you must see http://www.iplists.com/. This list and try to match patterns. And if it seems like a faked user-agent then ask to fill in a CAPTCHA on the next visit. (You need to research a bit more on bots IP addresses. I know this is achievable and also try whois of the IP address. It can be helpful.)
Case 2. No user agent of a search bot: Simply ask to fill in a CAPTCHA on the next visit.
Late answer - and also this answer probably isn't the one you want to hear...
Myself already wrote many (many tens) of different specialized data-mining scrapers. (just because I like the "open data" philosophy).
Here are already many advices in other answers - now i will play the devil's advocate role and will extend and/or correct their effectiveness.
First:
- if someone really wants your data
- you can't effectively (technically) hide your data
- if the data should be publicly accessible to your "regular users"
Trying to use some technical barriers aren't worth the troubles, caused:
- to your regular users by worsening their user-experience
- to regular and welcomed bots (search engines)
- etc...
Plain HMTL - the easiest way is parse the plain HTML pages, with well defined structure and css classes. E.g. it is enough to inspect element with Firebug, and use the right Xpaths, and/or CSS path in my scraper.
You could generate the HTML structure dynamically and also, you can generate dynamically the CSS class-names (and the CSS itself too) (e.g. by using some random class names) - but
- you want to present the informations to your regular users in consistent way
- e.g. again - it is enough to analyze the page structure once more to setup the scraper.
- and it can be done automatically by analyzing some "already known content"
- once someone already knows (by earlier scrape), e.g.:
- what contains the informations about "phil collins"
- enough display the "phil collins" page and (automatically) analyze how the page is structured "today" :)
You can't change the structure for every response, because your regular users will hate you. Also, this will cause more troubles for you (maintenance) not for the scraper. The XPath or CSS path is determinable by the scraping script automatically from the known content.
Ajax - little bit harder in the start, but many times speeds up the scraping process :) - why?
When analyzing the requests and responses, i just setup my own proxy server (written in perl) and my firefox is using it. Of course, because it is my own proxy - it is completely hidden - the target server see it as regular browser. (So, no X-Forwarded-for and such headers). Based on the proxy logs, mostly is possible to determine the "logic" of the ajax requests, e.g. i could skip most of the html scraping, and just use the well-structured ajax responses (mostly in JSON format).
So, the ajax doesn't helps much...
Some more complicated are pages which uses much packed javascript functions.
Here is possible to use two basic methods:
- unpack and understand the JS and create a scraper which follows the Javascript logic (the hard way)
- or (preferably using by myself) - just using Mozilla with Mozrepl for scrape. E.g. the real scraping is done in full featured javascript enabled browser, which is programmed to clicking to the right elements and just grabbing the "decoded" responses directly from the browser window.
Such scraping is slow (the scraping is done as in regular browser), but it is
- very easy to setup and use
- and it is nearly impossible to counter it :)
- and the "slowness" is needed anyway to counter the "blocking the rapid same IP based requests"
The User-Agent based filtering doesn't helps at all. Any serious data-miner will set it to some correct one in his scraper.
Require Login - doesn't helps. The simplest way beat it (without any analyze and/or scripting the login-protocol) is just logging into the site as regular user, using Mozilla and after just run the Mozrepl based scraper...
Remember, the require login helps for anonymous bots, but doesn't helps against someone who want scrape your data. He just register himself to your site as regular user.
Using frames isn't very effective also. This is used by many live movie services and it not very hard to beat. The frames are simply another one HTML/Javascript pages what are needed to analyze... If the data worth the troubles - the data-miner will do the required analyze.
IP-based limiting isn't effective at all - here are too many public proxy servers and also here is the TOR... :) It doesn't slows down the scraping (for someone who really wants your data).
Very hard is scrape data hidden in images. (e.g. simply converting the data into images server-side). Employing "tesseract" (OCR) helps many times - but honestly - the data must worth the troubles for the scraper. (which many times doesn't worth).
On the other side, your users will hate you for this. Myself, (even when not scraping) hate websites which doesn't allows copy the page content into the clipboard (because the information are in the images, or (the silly ones) trying to bond to the right click some custom Javascript event. :)
The hardest are the sites which using java applets or flash, and the applet uses secure https requests itself internally. But think twice - how happy will be your iPhone users... ;). Therefore, currently very few sites using them. Myself, blocking all flash content in my browser (in regular browsing sessions) - and never using sites which depends on Flash.
Your milestones could be..., so you can try this method - just remember - you will probably loose some of your users. Also remember, some SWF files are decompilable. ;)
Captcha (the good ones - like reCaptcha) helps a lot - but your users will hate you... - just imagine, how your users will love you when they need solve some captchas in all pages showing informations about the music artists.
Probably don't need to continue - you already got into the picture.
Now what you should do:
Remember: It is nearly impossible to hide your data, if you on the other side want publish them (in friendly way) to your regular users.
So,
- make your data easily accessible - by some API
- this allows the easy data access
- e.g. offload your server from scraping - good for you
- setup the right usage rights (e.g. for example must cite the source)
- remember, many data isn't copyright-able - and hard to protect them
- add some fake data (as you already done) and use legal tools
- as others already said, send an "cease and desist letter"
- other legal actions (sue and like) probably is too costly and hard to win (especially against non US sites)
Think twice before you will try to use some technical barriers.
Rather as trying block the data-miners, just add more efforts to your website usability. Your user will love you. The time (&energy) invested into technical barriers usually aren't worth - better to spend the time to make even better website...
Also, data-thieves aren't like normal thieves.
If you buy an inexpensive home alarm and add an warning "this house is connected to the police" - many thieves will not even try to break into. Because one wrong move by him - and he going to jail...
So, you investing only few bucks, but the thief investing and risk much.
But the data-thief hasn't such risks. just the opposite - ff you make one wrong move (e.g. if you introduce some BUG as a result of technical barriers), you will loose your users. If the the scraping bot will not work for the first time, nothing happens - the data-miner just will try another approach and/or will debug the script.
In this case, you need invest much more - and the scraper investing much less.
Just think where you want invest your time & energy...
Ps: english isn't my native - so forgive my broken english...
From a tech perspective: Just model what Google does when you hit them with too many queries at once. That should put a halt to a lot of it.
From a legal perspective: It sounds like the data you're publishing is not proprietary. Meaning you're publishing names and stats and other information that cannot be copyrighted.
If this is the case, the scrapers are not violating copyright by redistributing your information about artist name etc. However, they may be violating copyright when they load your site into memory because your site contains elements that are copyrightable (like layout etc).
I recommend reading about Facebook v. Power.com and seeing the arguments Facebook used to stop screen scraping. There are many legal ways you can go about trying to stop someone from scraping your website. They can be far reaching and imaginative. Sometimes the courts buy the arguments. Sometimes they don't.
But, assuming you're publishing public domain information that's not copyrightable like names and basic stats... you should just let it go in the name of free speech and open data. That is, what the web's all about.
Things that might work against beginner scrapers:
- IP blocking
- use lots of ajax
- check referer request header
- require login
Things that will help in general:
- change your layout every week
- robots.txt
Things that will help but will make your users hate you:
- captcha
I have done a lot of web scraping and summarized some techniques to stop web scrapers on my blog based on what I find annoying.
It is a tradeoff between your users and scrapers. If you limit IP's, use CAPTCHA's, require login, etc, you make like difficult for the scrapers. But this may also drive away your genuine users.
Your best option is unfortunately fairly manual: Look for traffic patterns that you believe are indicative of scraping and ban their IP addresses.
Since you're talking about a public site then making the site search-engine friendly will also make the site scraping-friendly. If a search-engine can crawl and scrape your site then an malicious scraper can as well. It's a fine-line to walk.
Sure it's possible. For 100% success, take your site offline.
In reality you can do some things that make scraping a little more difficult. Google does browser checks to make sure you're not a robot scraping search results (although this, like most everything else, can be spoofed).
You can do things like require several seconds between the first connection to your site, and subsequent clicks. I'm not sure what the ideal time would be or exactly how to do it, but that's another idea.
I'm sure there are several other people who have a lot more experience, but I hope those ideas are at least somewhat helpful.
There are a few things you can do to try and prevent screen scraping. Some are not very effective, while others (a CAPTCHA) are, but hinder usability. You have to keep in mind too that it may hinder legitimate site scrapers, such as search engine indexes.
However, I assume that if you don't want it scraped that means you don't want search engines to index it either.
Here are some things you can try:
- Show the text in an image. This is quite reliable, and is less of a pain on the user than a CAPTCHA, but means they won't be able to cut and paste and it won't scale prettily or be accessible.
- Use a CAPTCHA and require it to be completed before returning the page. This is a reliable method, but also the biggest pain to impose on a user.
- Require the user to sign up for an account before viewing the pages, and confirm their email address. This will be pretty effective, but not totally - a screen-scraper might set up an account and might cleverly program their script to log in for them.
- If the client's user-agent string is empty, block access. A site-scraping script will often be lazily programmed and won't set a user-agent string, whereas all web browsers will.
- You can set up a black list of known screen scraper user-agent strings as you discover them. Again, this will only help the lazily-coded ones; a programmer who knows what he's doing can set a user-agent string to impersonate a web browser.
- Change the URL path often. When you change it, make sure the old one keeps working, but only for as long as one user is likely to have their browser open. Make it hard to predict what the new URL path will be. This will make it difficult for scripts to grab it if their URL is hard-coded. It'd be best to do this with some kind of script.
If I had to do this, I'd probably use a combination of the last three, because they minimise the inconvenience to legitimate users. However, you'd have to accept that you won't be able to block everyone this way and once someone figures out how to get around it, they'll be able to scrape it forever. You could then just try to block their IP addresses as you discover them I guess.
- No, it's not possible to stop (in any way)
- Embrace it. Why not publish as RDFa and become super search engine friendly and encourage the re-use of data? People will thank you and provide credit where due (see musicbrainz as an example).
It is not the answer you probably want, but why hide what you're trying to make public?
Method One (Small Sites Only):
Serve encrypted / encoded data.
I Scape the web using python (urllib, requests, beautifulSoup etc...) and found many websites that serve encrypted / encoded data that is not decrypt-able in any programming language simply because the encryption method does not exist.
I achieved this in a PHP website by encrypting and minimizing the output (WARNING: this is not a good idea for large sites) the response was always jumbled content.
Example of minimizing output in PHP (How to minify php page html output?):
<?php
function sanitize_output($buffer) {
$search = array(
'/\>[^\S ]+/s', // strip whitespaces after tags, except space
'/[^\S ]+\</s', // strip whitespaces before tags, except space
'/(\s)+/s' // shorten multiple whitespace sequences
);
$replace = array('>', '<', '\\1');
$buffer = preg_replace($search, $replace, $buffer);
return $buffer;
}
ob_start("sanitize_output");
?>
Method Two:
If you can't stop them screw them over serve fake / useless data as a response.
Method Three:
block common scraping user agents, you'll see this in major / large websites as it is impossible to scrape them with "python3.4" as you User-Agent.
Method Four:
Make sure all the user headers are valid, I sometimes provide as many headers as possible to make my scraper seem like an authentic user, some of them are not even true or valid like en-FU :).
Here is a list of some of the headers I commonly provide.
headers = {
"Requested-URI": "/example",
"Request-Method": "GET",
"Remote-IP-Address": "656.787.909.121",
"Remote-IP-Port": "69696",
"Protocol-version": "HTTP/1.1",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip,deflate",
"Accept-Language": "en-FU,en;q=0.8",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Dnt": "1",
"Host": "http://example.com",
"Referer": "http://example.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36"
}
Rather than blacklisting bots, maybe you should whitelist them. If you don't want to kill your search results for the top few engines, you can whitelist their user-agent strings, which are generally well-publicized. The less ethical bots tend to forge user-agent strings of popular web browsers. The top few search engines should be driving upwards of 95% of your traffic.
Identifying the bots themselves should be fairly straightforward, using the techniques other posters have suggested.
Quick approach to this would be to set a booby/bot trap.
Make a page that if it's opened a certain amount of times or even opened at all, will collect certain information like the IP and whatnot (you can also consider irregularities or patterns but this page shouldn't have to be opened at all).
Make a link to this in your page that is hidden with CSS display:none; or left:-9999px; positon:absolute; try to place it in places that are less unlikely to be ignored like where your content falls under and not your footer as sometimes bots can choose to forget about certain parts of a page.
In your robots.txt file set a whole bunch of disallow rules to pages you don't want friendly bots (LOL, like they have happy faces!) to gather information on and set this page as one of them.
Now, If a friendly bot comes through it should ignore that page. Right but that still isn't good enough. Make a couple more of these pages or somehow re-route a page to accept differnt names. and then place more disallow rules to these trap pages in your robots.txt file alongside pages you want ignored.
Collect the IP of these bots or anyone that enters into these pages, don't ban them but make a function to display noodled text in your content like random numbers, copyright notices, specific text strings, display scary pictures, basically anything to hinder your good content. You can also set links that point to a page which will take forever to load ie. in php you can use the sleep() function. This will fight the crawler back if it has some sort of detection to bypass pages that take way too long to load as some well written bots are set to process X amount of links at a time.
If you have made specific text strings/sentences why not go to your favorite search engine and search for them, it might show you where your content is ending up.
Anyway, if you think tactically and creatively this could be a good starting point. The best thing to do would be to learn how a bot works.
I'd also think about scambling some ID's or the way attributes on the page element are displayed:
<a class="someclass" href="../xyz/abc" rel="nofollow" title="sometitle">
that changes its form every time as some bots might be set to be looking for specific patterns in your pages or targeted elements.
<a title="sometitle" href="../xyz/abc" rel="nofollow" class="someclass">
id="p-12802" > id="p-00392"
You can't stop normal screen scraping. For better or worse, it's the nature of the web.
You can make it so no one can access certain things (including music files) unless they're logged in as a registered user. It's not too difficult to do in Apache. I assume it wouldn't be too difficult to do in IIS as well.
One way would be to serve the content as XML attributes, URL encoded strings, preformatted text with HTML encoded JSON, or data URIs, then transform it to HTML on the client. Here are a few sites which do this:
Skechers: XML
<document filename="" height="" width="" title="SKECHERS" linkType="" linkUrl="" imageMap="" href="http://www.bobsfromskechers.com" alt="BOBS from Skechers" title="BOBS from Skechers" />
Chrome Web Store: JSON
<script type="text/javascript" src="https://apis.google.com/js/plusone.js">{"lang": "en", "parsetags": "explicit"}</script>
Bing News: data URL
<script type="text/javascript"> //<![CDATA[ (function() { var x;x=_ge('emb7'); if(x) { x.src='data:image/jpeg;base64,/*...*/'; } }() )
Protopage: URL Encoded Strings
unescape('Rolling%20Stone%20%3a%20Rock%20and%20Roll%20Daily')
TiddlyWiki : HTML Entities + preformatted JSON
<pre> {"tiddlers": { "GettingStarted": { "title": "GettingStarted", "text": "Welcome to TiddlyWiki, } } } </pre>
Amazon: Lazy Loading
amzn.copilot.jQuery=i;amzn.copilot.jQuery(document).ready(function(){d(b);f(c,function() {amzn.copilot.setup({serviceEndPoint:h.vipUrl,isContinuedSession:true})})})},f=function(i,h){var j=document.createElement("script");j.type="text/javascript";j.src=i;j.async=true;j.onload=h;a.appendChild(j)},d=function(h){var i=document.createElement("link");i.type="text/css";i.rel="stylesheet";i.href=h;a.appendChild(i)}})(); amzn.copilot.checkCoPilotSession({jsUrl : 'http://z-ecx.images-amazon.com/images/G/01/browser-scripts/cs-copilot-customer-js/cs-copilot-customer-js-min-1875890922._V1_.js', cssUrl : 'http://z-ecx.images-amazon.com/images/G/01/browser-scripts/cs-copilot-customer-css/cs-copilot-customer-css-min-2367001420._V1_.css', vipUrl : 'https://copilot.amazon.com'
XMLCalabash: Namespaced XML + Custom MIME type + Custom File extension
<p:declare-step type="pxp:zip"> <p:input port="source" sequence="true" primary="true"/> <p:input port="manifest"/> <p:output port="result"/> <p:option name="href" required="true" cx:type="xsd:anyURI"/> <p:option name="compression-method" cx:type="stored|deflated"/> <p:option name="compression-level" cx:type="smallest|fastest|default|huffman|none"/> <p:option name="command" select="'update'" cx:type="update|freshen|create|delete"/> </p:declare-step>
If you view source on any of the above, you see that scraping will simply return metadata and navigation.
Most have been already said, but have you considered the CloudFlare protection? I mean this:
Other companies probably do this too, CloudFlare is the only one I know.
I'm pretty sure that would complicate their work. I also once got IP banned automatically for 4 months when I tried to scrap data of a site protected by CloudFlare due to rate limit (I used simple AJAX request loop).
I agree with most of the posts above, and I'd like to add that the more search engine friendly your site is, the more scrape-able it would be. You could try do a couple of things that are very out there that make it harder for scrapers, but it might also affect your search-ability... It depends on how well you want your site to rank on search engines of course.
Putting your content behind a captcha would mean that robots would find it difficult to access your content. However, humans would be inconvenienced so that may be undesirable.
If you want to see a great example, check out http://www.bkstr.com/. They use a j/s algorithm to set a cookie, then reloads the page so it can use the cookie to validate that the request is being run within a browser. A desktop app built to scrape could definitely get by this, but it would stop most cURL type scraping.
Screen scrapers work by processing HTML. And if they are determined to get your data there is not much you can do technically because the human eyeball processes anything. Legally it's already been pointed out you may have some recourse though and that would be my recommendation.
However, you can hide the critical part of your data by using non-HTML-based presentation logic
- Generate a Flash file for each artist/album, etc.
- Generate an image for each artist content. Maybe just an image for the artist name, etc. would be enough. Do this by rendering the text onto a JPEG/PNG file on the server and linking to that image.
Bear in mind that this would probably affect your search rankings.
Generate the HTML, CSS and JavaScript. It is easier to write generators than parsers, so you could generate each served page differently. You can no longer use a cache or static content then.
来源:https://stackoverflow.com/questions/3161548/how-do-i-prevent-site-scraping