robots.txt

Can I allow indexing (by search engines) of restricted content without making it public?

允我心安 提交于 2019-12-12 01:58:14
问题 I have a site with some restricted content. I want my site to appear in search results, but I do not want it to get public. Is there a way by which I can allow crawlers to crawl through my site but prevent them from making it public? The closest solution I have found is Google First Click Free but even it requires me to show the content for the first time. 回答1: Why do you want to allow people to search for a page that they can't access if they click the link? Its technically possible to make

Robots.txt: disallow a folder's name, regardless at which depth it may show up

一笑奈何 提交于 2019-12-11 13:10:32
问题 So I have to disallow search engines from indexing our REST web service responses (it's a Sitecore website); all of them have the same name in the URL but show up at different levels in the server hierarchy, and I was wondering if I can write a "catch all" entry in our robots file or if I am doomed to write an extensive list. Can I add something like Disallow: */ajax/* to catch all folders named "ajax" regardless of where they appear? 回答1: robots.txt specification doesn't say anything about

Common rule in robots.txt

。_饼干妹妹 提交于 2019-12-11 11:02:19
问题 How can I disallow URLs like 1.html, 2.html, ..., [0-9]+.html (in terms of regexp) with robots.txt ? 回答1: The original robots.txt specification doesn't support regex/wildcards. However, you could block URLs like these: example.com/1.html example.com/2367123.html example.com/3 example.com/4/foo example.com/5/1 example.com/6/ example.com/7.txt example.com/883 example.com/9to5 … with: User-agent: * Disallow: /0 Disallow: /1 Disallow: /2 Disallow: /3 Disallow: /4 Disallow: /5 Disallow: /6

PHP file_exists() for URL/robots.txt returns false

馋奶兔 提交于 2019-12-11 08:26:19
问题 I tryed to use file_exists(URL/robots.txt) to see if the file exists on randomly chosen websites and i get a false response; How do i check if the robots.txt file exists ? I dont want to start the download before i check. Using fopen() will do the trick ? because : Returns a file pointer resource on success, or FALSE on error. and i guess that i can put something like: $f=@fopen($url,"r"); if($f) ... my code: http://www1.macys.com/robots.txt maybe it's not there http://www.intend.ro/robots

Can I use robots.txt to block any directory tree that starts with numbers?

核能气质少年 提交于 2019-12-11 04:10:02
问题 I'm not even sure if this is the best way to handle this, but I had made a temporary mistake with my rewrites and Google (possibly others) picked up on it, now it has them indexed and keeps coming up with errors. Basically, I'm generating URLs based on a variety of factors, one being the id of an article, which is automatically generated. These then redirect to the correct spot. I had first accidentally set up stuff like this: /2343/news/blahblahblah /7645/reviews/blahblahblah Etc. This was a

Using “Disallow: /*?” in robots.txt file

一曲冷凌霜 提交于 2019-12-11 00:51:40
问题 I used Disallow: /*? in the robots.txt file to disallow all pages that might contain a "?" in the URL. Is that syntax correct, or am I blocking other pages as well? 回答1: It depends on the bot. Bots that follow the original robots.txt specification don’t give the * any special meaning. These bots would block any URL whose path starts with /* , directly followed by ? , e.g., http://example.com/*?foo . Some bots, including the Googlebot, give the * character a special meaning. It typically

prevent googlebot from indexing file types in robots.txt and .htaccess

余生长醉 提交于 2019-12-10 20:17:48
问题 There are many Stack Overflow questions on how to prevent google bot from indexing, for instance, txt files. There's this: robots.txt User-agent: Googlebot Disallow: /*.txt$ .htaccess <Files ~ "\.txt$"> Header set X-Robots-Tag "noindex, nofollow" </Files> However, what is the syntax for both of these when trying to prevent two types of files from being indexed? In my case - txt and doc . 回答1: In your robots.txt file: User-agent: Googlebot Disallow: /*.txt$ Disallow: /*.doc$ More details at

Restrict robot access for (specific) query string (parameter) values?

橙三吉。 提交于 2019-12-10 19:30:12
问题 Using robot.txt is it possible to restrict robot access for (specific) query string (parameter) values? ie http://www.url.com/default.aspx #allow http://www.url.com/default.aspx?id=6 #allow http://www.url.com/default.aspx?id=7 #disallow 回答1: User-agent: * Disallow: /default.aspx?id=7 # disallow Disallow: /default.aspx?id=9 # disallow Disallow: /default.aspx?id=33 # disallow etc... You only need to specify the url's that are disallowed. Everything else is allowed by default. 回答2: Can just the

Duplicated content on google. htaccess or robots.txt? [closed]

我的未来我决定 提交于 2019-12-10 18:45:01
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 7 years ago . In my website I have the following categories url structure: /category.php?id=6 (id=6 is for internet category) My SEO friendly url is like: /category/6/internet/ The problem is it can be accessed in any of those forms, and because of that, I'm getting duplicate content on google. So, I'm wondering how can I fix

Where to put robots.txt in tomcat 7?

一曲冷凌霜 提交于 2019-12-10 16:29:11
问题 I'm using Tomcat 7 to host my application. I've used a ROOT.xml file under tomcat-home\conf\Catalina\localhost <Context docBase="C:\Program Files\Apache Software Foundation\Tomcat 7.0\mywebapp\MyApplication" path="" reloadable="true" /> This is to load my webapp in the root context. But now I'm confused as to where to put the robots.txt and sitemap.xml files. When I put in under C:\Program Files\Apache Software Foundation\Tomcat 7.0\mywebapp\MyApplication, it doesn't show up. I've also tried