robots.txt

BOT/Spider Trap Ideas

血红的双手。 提交于 2019-12-10 14:59:42
问题 I have a client whose domain seems to be getting hit pretty hard by what appears to be a DDoS. In the logs it's normal looking user agents with random IPs but they're flipping through pages too fast to be human. They also don't appear to be requesting any images. I can't seem to find any pattern and my suspicion is it's a fleet of Windows Zombies. The clients had issues in the past with SPAM attacks--even had to point MX at Postini to get the 6.7 GB/day of junk to stop server-side. I want to

Parsing Robots.txt in python

非 Y 不嫁゛ 提交于 2019-12-10 12:07:42
问题 I want to parse robots.txt file in python. I have explored robotParser and robotExclusionParser but nothing really satisfy my criteria. I want to fetch all the diallowedUrls and allowedUrls in a single shot rather then manually checking for each url if it is allowed or not. Is there any library to do this? 回答1: You can use curl command to read the robots.txt file into a single string split it with new line check for allow and disallow urls. import os result = os.popen("curl https://fortune

.htaccess and robots.txt rewrite url how to

和自甴很熟 提交于 2019-12-10 11:51:51
问题 Here is my code for .htaccess RewriteEngine on RewriteRule (.*) index.html Problem : when visit mydomain.com/robots.txt then page again redirect to index.html Required : if(url contain robots.txt) Then redirect to mydomain.com/robots.txt else redirect to index.html 回答1: Try this: RewriteEngine on RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule (.*) index.html Basically, those two RewriteCond tell apache to rewrite the URL only if the requested file ( -f )

Blocking bots by modifying htaccess

点点圈 提交于 2019-12-10 11:33:58
问题 I am trying to block a couple bots via my htaccess file. On Search Engine Watch it is recommended to use the below. I did block these bots in the robots.txt file but they are ignoring it. Here is code from Search Engine Watch: RewriteEngine on Options +FollowSymlinks RewriteBase / RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [NC,OR” RewriteCond %{HTTP_USER_AGENT} ^Sogou RewriteRule ^.*$ - [F” My current htaccess file is as below. How exactly would I modify my current .htaccess with the above

robots.txt : how to disallow subfolders of dynamic folder

[亡魂溺海] 提交于 2019-12-10 11:15:45
问题 I have urls like these: /products/:product_id/deals/new /products/:product_id/deals/index I'd like to disallow the "deals" folder in my robots.txt file. [Edit] I'd like to disallow this folder for Google, Yahoo and Bing Bots. Does anyone know if these bots support wildcard character and so would support the following rule? Disallow: /products/*/deals Also... Do you have any really good tuto on robots.txt rules? As I didn't manage to find a "really" good one I could use one... And one last

block google robots for URLS containing a certain word

匆匆过客 提交于 2019-12-09 17:58:50
问题 my client has a load of pages which they dont want indexed by google - they are all called http://example.com/page-xxx so they are /page-123 or /page-2 or /page-25 etc Is there a way to stop google indexing any page that starts with /page-xxx using robots.txt would something ike this work? Disallow: /page-* Thanks 回答1: In the first place, a line that says Disallow: /post-* isn't going to do anything to prevent crawling of pages of the form "/page-xxx". Did you mean to put "page" in your

robots.txt; What encoding?

烈酒焚心 提交于 2019-12-08 17:15:05
问题 I am about to create a robots.txt file. I am using notepad . How should I save the file? UTF8 , ANSI or what? Also, should it be a capital R ? And in the file, I am specifying a sitemap location. Should this be with a capital S ? User-agent: * Sitemap: http://www.domain.se/sitemap.xml Thanks 回答1: Since the file should consist of only ASCII characters, it normally doesn't matter if you save it as ANSI or UTF-8. However, you should choose ANSI if you have a choice because when you save a file

Why google index this? [closed]

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-08 10:38:24
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 9 years ago . In this webpage: http://www.alvolante.it/news/pompe_benzina_%E2%80%9Ctruccate%E2%80%9D_autostrada-308391044 there is this image: http://immagini.alvolante.it/sites/default/files/imagecache/anteprima_100/images/rifornimento_benzina.jpg Why this image is indexed if in the robots.txt there is "Disallow: /sites/" ??

Robots.txt Disallow Certain Folder Names

社会主义新天地 提交于 2019-12-07 05:17:31
问题 I want to disallow robots from crawling any folder, at any position in the url with the name: this-folder . Examples to disallow: http://mysite.com/this-folder/ http://mysite.com/houses/this-folder/ http://mysite.com/some-other/this-folder/ http://mysite.com/no-robots/this-folder/ This is my attempt: Disallow: /.*this-folder/ Will this work? 回答1: Officially globbing and regular expressions are not supported: http://www.robotstxt.org/robotstxt.html but apparently some search engines support

Robots.txt - What is the proper format for a Crawl Delay for multiple user agents?

烂漫一生 提交于 2019-12-06 18:44:45
问题 Below is a sample robots.txt file to Allow multiple user agents with multiple crawl delays for each user agent. The Crawl-delay values are for illustration purposes and will be different in a real robots.txt file. I have searched all over the web for proper answers but could not find one. There are too many mixed suggestions and I do not know which is the correct / proper method. Questions: (1) Can each user agent have it's own crawl-delay? (I assume yes) (2) Where do you put the crawl-delay