robots.txt | 易学教程

Can I allow indexing (by search engines) of restricted content without making it public?

阅读更多关于 Can I allow indexing (by search engines) of restricted content without making it public?

问题 I have a site with some restricted content. I want my site to appear in search results, but I do not want it to get public. Is there a way by which I can allow crawlers to crawl through my site but prevent them from making it public? The closest solution I have found is Google First Click Free but even it requires me to show the content for the first time. 回答1: Why do you want to allow people to search for a page that they can't access if they click the link? Its technically possible to make

Robots.txt: disallow a folder's name, regardless at which depth it may show up

阅读更多关于 Robots.txt: disallow a folder's name, regardless at which depth it may show up

问题 So I have to disallow search engines from indexing our REST web service responses (it's a Sitecore website); all of them have the same name in the URL but show up at different levels in the server hierarchy, and I was wondering if I can write a "catch all" entry in our robots file or if I am doomed to write an extensive list. Can I add something like Disallow: */ajax/* to catch all folders named "ajax" regardless of where they appear? 回答1: robots.txt specification doesn't say anything about

Common rule in robots.txt

阅读更多关于 Common rule in robots.txt

问题 How can I disallow URLs like 1.html, 2.html, ..., [0-9]+.html (in terms of regexp) with robots.txt ? 回答1: The original robots.txt specification doesn't support regex/wildcards. However, you could block URLs like these: example.com/1.html example.com/2367123.html example.com/3 example.com/4/foo example.com/5/1 example.com/6/ example.com/7.txt example.com/883 example.com/9to5 … with: User-agent: * Disallow: /0 Disallow: /1 Disallow: /2 Disallow: /3 Disallow: /4 Disallow: /5 Disallow: /6

PHP file_exists() for URL/robots.txt returns false

阅读更多关于 PHP file_exists() for URL/robots.txt returns false

问题 I tryed to use file_exists(URL/robots.txt) to see if the file exists on randomly chosen websites and i get a false response; How do i check if the robots.txt file exists ? I dont want to start the download before i check. Using fopen() will do the trick ? because : Returns a file pointer resource on success, or FALSE on error. and i guess that i can put something like: $f=@fopen($url,"r"); if($f) ... my code: http://www1.macys.com/robots.txt maybe it's not there http://www.intend.ro/robots

Can I use robots.txt to block any directory tree that starts with numbers?

阅读更多关于 Can I use robots.txt to block any directory tree that starts with numbers?

问题 I'm not even sure if this is the best way to handle this, but I had made a temporary mistake with my rewrites and Google (possibly others) picked up on it, now it has them indexed and keeps coming up with errors. Basically, I'm generating URLs based on a variety of factors, one being the id of an article, which is automatically generated. These then redirect to the correct spot. I had first accidentally set up stuff like this: /2343/news/blahblahblah /7645/reviews/blahblahblah Etc. This was a

Using “Disallow: /*?” in robots.txt file

阅读更多关于 Using “Disallow: /*?” in robots.txt file

问题 I used Disallow: /*? in the robots.txt file to disallow all pages that might contain a "?" in the URL. Is that syntax correct, or am I blocking other pages as well? 回答1: It depends on the bot. Bots that follow the original robots.txt specification don’t give the * any special meaning. These bots would block any URL whose path starts with /* , directly followed by ? , e.g., http://example.com/*?foo . Some bots, including the Googlebot, give the * character a special meaning. It typically

prevent googlebot from indexing file types in robots.txt and .htaccess

阅读更多关于 prevent googlebot from indexing file types in robots.txt and .htaccess

问题 There are many Stack Overflow questions on how to prevent google bot from indexing, for instance, txt files. There's this: robots.txt User-agent: Googlebot Disallow: /*.txt$ .htaccess <Files ~ "\.txt$"> Header set X-Robots-Tag "noindex, nofollow" </Files> However, what is the syntax for both of these when trying to prevent two types of files from being indexed? In my case - txt and doc . 回答1: In your robots.txt file: User-agent: Googlebot Disallow: /*.txt$ Disallow: /*.doc$ More details at

Restrict robot access for (specific) query string (parameter) values?

阅读更多关于 Restrict robot access for (specific) query string (parameter) values?

问题 Using robot.txt is it possible to restrict robot access for (specific) query string (parameter) values? ie http://www.url.com/default.aspx #allow http://www.url.com/default.aspx?id=6 #allow http://www.url.com/default.aspx?id=7 #disallow 回答1: User-agent: * Disallow: /default.aspx?id=7 # disallow Disallow: /default.aspx?id=9 # disallow Disallow: /default.aspx?id=33 # disallow etc... You only need to specify the url's that are disallowed. Everything else is allowed by default. 回答2: Can just the

Duplicated content on google. htaccess or robots.txt? [closed]

阅读更多关于 Duplicated content on google. htaccess or robots.txt? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 7 years ago . In my website I have the following categories url structure: /category.php?id=6 (id=6 is for internet category) My SEO friendly url is like: /category/6/internet/ The problem is it can be accessed in any of those forms, and because of that, I'm getting duplicate content on google. So, I'm wondering how can I fix

Where to put robots.txt in tomcat 7?

阅读更多关于 Where to put robots.txt in tomcat 7?

问题 I'm using Tomcat 7 to host my application. I've used a ROOT.xml file under tomcat-home\conf\Catalina\localhost <Context docBase="C:\Program Files\Apache Software Foundation\Tomcat 7.0\mywebapp\MyApplication" path="" reloadable="true" /> This is to load my webapp in the root context. But now I'm confused as to where to put the robots.txt and sitemap.xml files. When I put in under C:\Program Files\Apache Software Foundation\Tomcat 7.0\mywebapp\MyApplication, it doesn't show up. I've also tried