robots.txt

How to assign specific sitemaps for specific crawler-bots in robots.txt?

时光毁灭记忆、已成空白 提交于 2019-12-23 17:14:57
问题 Since some crawlers don't like the sitemap versions made for Google, I made different sitemaps. And there is an option to put Sitemap: http://example.com/sitemap.xml to robots.txt. But is it possible to put it kinda like this: User-agent: * Sitemap: http://example.com/sitemap.xml User-agent: googlebot Sitemap: http://example.com/sitemap-for-google.xml I couldn't find any resource for this topic and robots.txt is not something I want to joke around with. 回答1: This is not possible in robots.txt

Robots.txt to disallow everything and allow only specific parts of the site/pages. Is “allow” supported by crawlers like Ultraseek and FAST?

左心房为你撑大大i 提交于 2019-12-23 12:03:58
问题 Just wanted to know if it is possible to disallow the whole site for crawlers and allow only specific webpages or sections? Is "allow" supported by crawlers like FAST and Ultraseek? 回答1: There is an Allow Directive however there's no guarantee that a particular bot will support it (much like there's no guarantee a bot will even check your robots.txt to begin with). You could probably tell by examining your weblogs whether or not specific bots were indexing only the parts of your website that

robots.txt allow all except few sub-directories

我与影子孤独终老i 提交于 2019-12-22 05:53:45
问题 I want my site to be indexed in search engines except few sub-directories. Following are my robots.txt settings: robots.txt in the root directory User-agent: * Allow: / Separate robots.txt in the sub-directory (to be excluded) User-agent: * Disallow: / Is it the correct way or the root directory rule will override the sub-directory rule? 回答1: No, this is wrong. You can’t have a robots.txt in a sub-directory. Your robots.txt must be placed in the document root of your host. If you want to

Can I use the “Host” directive in robots.txt?

Deadly 提交于 2019-12-22 04:37:09
问题 Searching for specific information on the robots.txt , I stumbled upon a Yandex help page on this topic. It suggests that I could use the Host directive to tell crawlers my preferred mirror domain: User-Agent: * Disallow: /dir/ Host: www.myhost.com Also, the Wikipedia article states that Google too understands the Host directive, but there wasn’t much (i.e. none) information. At robotstxt.org, I didn’t find anything on Host (or Crawl-delay as stated on Wikipedia). Is it encouraged to use the

Googlebots Ignoring robots.txt? [closed]

那年仲夏 提交于 2019-12-22 04:35:22
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 7 years ago . I have a site with the following robots.txt in the root: User-agent: * Disabled: / User-agent: Googlebot Disabled: / User-agent: Googlebot-Image Disallow: / And pages within this site are getting scanned by Googlebots all day long. Is there something wrong with my file or with Google? 回答1: It should be Disallow:

Rendering plain text through PHP

久未见 提交于 2019-12-22 04:03:48
问题 For some reason, I want to serve my robots.txt via a PHP script. I have setup apache so that the robots.txt file request (infact all file requests) come to a single PHP script. The code I am using to render robots.txt is: echo "User-agent: wget\n"; echo "Disallow: /\n"; However, it is not processing the newlines. How to server robots.txt correctly, so search engines (or any client) see it properly? Do I have to send some special headers for txt files? EDIT 1: Now I have the following code:

How to disallow search pages from robots.txt

扶醉桌前 提交于 2019-12-21 20:26:58
问题 I need to disallow http://example.com/startup?page=2 search pages from being indexed. I want http://example.com/startup to be indexed but not http://example.com/startup?page=2 and page3 and so on. Also, startup can be random, e.g., http://example.com/XXXXX?page 回答1: Something like this works, as confirmed by Google Webmaster Tools "test robots.txt" function: User-Agent: * Disallow: /startup?page= Disallow The value of this field specifies a partial URL that is not to be visited. This can be a

Robots.txt, how to allow access only to domain root, and no deeper? [closed]

怎甘沉沦 提交于 2019-12-21 07:16:14
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 7 years ago . I want to allow crawlers to access my domain's root directory (i.e. the index.html file), but nothing deeper (i.e. no subdirectories). I do not want to have to list and deny every subdirectory individually within the robots.txt file. Currently I have the following, but I think it is blocking everything,

How to configure robots.txt to allow everything?

僤鯓⒐⒋嵵緔 提交于 2019-12-20 08:15:13
问题 My robots.txt in Google Webmaster Tools shows the following values: User-agent: * Allow: / What does it mean? I don't have enough knowledge about it, so looking for your help. I want to allow all robots to crawl my website, is this the right configuration? 回答1: That file will allow all crawlers access User-agent: * Allow: / This basically allows all user agents (the *) to all parts of the site (the /). 回答2: If you want to allow every bot to crawl everything, this is the best way to specify it

Robots.txt restriction of category URLs

别说谁变了你拦得住时间么 提交于 2019-12-20 06:37:21
问题 I was unable to find information about my case. I want to restrict the following types of URLs to be indexed: website.com/video-title/video-title/ (my website produces such double URL copies of my video-articles) Each video article starts with the word "video" in the beginning of its URL. So what I want to do is to restrict all URLs that have website.com/"any-url"/video-any-url" This way I will remove all the doubled copies. Could somebody help me? 回答1: This is not possible in the original