Prevent site data from being crawled and ripped

前端未结

关注

 12  837

I\'m looking into building a content site with possibly thousands of different entries, accessible by index and by search.

What are the measures I can take to preven

相关标签:

12条回答

死守一世寂寞

2020-12-15 06:44
Between this:

What are the measures I can take to prevent malicious crawlers from ripping

and this:

I wouldn't want to block legitimate crawlers all together.

you're asking for a lot. Fact is, if you're going to try and block malicious scrapers, you're going to end up blocking all the "good" crawlers too.

You have to remember that if people want to scrape your content, they're going to put in a lot more manual effort than a search engine bot will... So get your priorities right. You've two choices:
1. Let the peasants of the internet steal your content. Keep an eye out for it (searching Google for some of your more unique phrases) and sending take-down requests to ISPs. This choice has barely any impact on your apart from the time.
2. Use AJAX and rolling encryption to request all your content from the server. You'll need to keep the method changing, or even random so each pageload carries a different encryption scheme. But even this will be cracked if somebody wants to crack it. You'll also drop off the face of the search engines and therefore take a hit in traffic of real users.
0 讨论(0)
发布评论:

提交评论
- 加载中...
夕颜

2020-12-15 06:48

You could try using Flash / Silverlight / Java to display all your page contents. That would probably stop most crawlers in their tracks.

0 讨论(0)
发布评论:

提交评论
- 加载中...
无人共我

2020-12-15 06:48

Use where ever is possible human validators and try using some framework (MVC). The site ripping software is sometimes unable to rip this kind of page. Also detect the user agent, at least it will reduce the number of possible rippers

0 讨论(0)
发布评论:

提交评论
- 加载中...
囚心锁ツ

2020-12-15 06:51

I used to have a system that would block or allow based on the User-Agent header. It relies on the crawler setting their User-Agent but it seems most of them do.

It won't work if they use a fake header to emulate a popular browser of course.

0 讨论(0)
发布评论:

提交评论
- 加载中...
盖世英雄少女心

2020-12-15 06:54

The only way to stop a site being machine ripped is to make the user prove that they are human.

You could make users perform a task that is easy for humans and hard for machines, eg: CAPTCHA. When a user first gets to your site present a CAPTCHA and only allow them to proceed once it has completed. If the user starts moving from page to page too quickly re-verify.

This is not 100% effective and hackers are always trying to break them.

Alternatively you could make slow responses. You don't need to make them crawl, but pick a speed that is reasonable for humans (this would be very slow for a machine). This just makes them take longer to scrape your site, but not impossible.

OK. Out of ideas.

0 讨论(0)
发布评论:

提交评论
- 加载中...
臣服心动

2020-12-15 06:56

If you're making a public site, then it's very difficult. There are methods that involve server-side scripting to generate content or the use of non-text (Flash, etc) to minimize the likelihood of ripping.

But to be honest, if you consider your content to be so good, just password-protect it and remove it from the public arena.

My opinion is that the whole point of the web is to propagate useful content to as many people as possible.

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2