robots.txt

How can I tell Google not to crawl a set of Urls

不想你离开。 提交于 2019-12-13 09:46:09
问题 How do I stop google to crawl to certain urls in my application? For example: I want google to stop crawling all the URLs that starts with http://www.myhost-test.com/ What should I add in my robot.txt? 回答1: The answer can be found directly here: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449 But it looks like you add "disallow" and your url. 来源: https://stackoverflow.com/questions/11542918/how-can-i-tell-google-not-to-crawl-a-set-of-urls

Get google to index links from javascript generated content

隐身守侯 提交于 2019-12-13 02:26:51
问题 On my site I have a directory of things which is generated through jquery ajax calls, which subsequently creates the html. To my knwoledge goole and other bots aren't aware of dom changes after the page load, and won't index the directory. What I'd like to achieve, is to serve the search bots a dedicated page which only contains the links to the things. Would adding a noscript tag to the directory page be a solution? (in the noscript section, I would link to a page which merely serves the

Hide web pages to the search engines robots

拥有回忆 提交于 2019-12-13 01:30:04
问题 I need to hide all my sites pages to ALL the spider robots, except for the home page (www.site.com) that should be parsed from robots. Does anyone knows how can i do that? 回答1: add to all pages you do not want to index tag <meta name="robots" content="noindex" /> or you can create robots.txt in your document root and put there something like: User-agent: * Allow: /$ Disallow: /* 来源: https://stackoverflow.com/questions/12807657/hide-web-pages-to-the-search-engines-robots

Blocking folders inbetween allowed content

自闭症网瘾萝莉.ら 提交于 2019-12-13 01:12:47
问题 I have a site with the following structure: http://www.example.com/folder1/folder2/folder3 I would like to disallow indexing in folder1 , and folder2 . But I would like the robots to index everything under folder3 . Is there a way to do this with the robots.txt? For what I read I think that everything inside a specified folder is disallowed. Would the following achieve my goal? user-agent: * Crawl-delay: 0 Sitemap: <Sitemap url> Allow: /folder1/folder2/folder3 Disallow: /folder1/folder2/

What are recommended directives for robots.txt in a Django application?

ぐ巨炮叔叔 提交于 2019-12-12 23:15:41
问题 Currently my django project has following structure. ./ ../ app1/ app2/ django_project manage.py media static secret_stuff and my robots.txt looks something like this: User-agent: * Allow: / Sitemap: mysite.com/sitemaps.xml I want to know following things: What are the recommend directives should i add to my robots.txt file, as django documentation is saying nothing on this topic. How do i stop bots from reaching (indexing) contents of secret_stuff and mysite.com/admin/ directory ? Disallow:

How to hide website directory from search engines without Robots.txt?

大城市里の小女人 提交于 2019-12-12 21:42:36
问题 We know we can stop search engines from indexing directories on our site using robots.txt. But this of course has the disadvantage of actually publicising directories we don't want found to possible attackers. Password protecting the directory using .htaccess or other means is obviously the best way to keep the directory private. But what if, for reasons of convenience, we didn't want to add another layer of security to the directory and just wanted to add another level of obfuscation? To

Robots.txt file in MVC.NET 4

半世苍凉 提交于 2019-12-12 08:29:41
问题 I have read an article about ignoring the robots from some url in my ASP MVC.NET project. In his article author said that we should add some action in some off controllers like this. In this example he adds the action to the Home Controller: #region -- Robots() Method -- public ActionResult Robots() { Response.ContentType = "text/plain"; return View(); } #endregion then we should add a Robots.cshtml file in our project with this body @{ Layout = null; } # robots.txt for @this.Request.Url.Host

Robots.txt: allow only major SE

守給你的承諾、 提交于 2019-12-12 07:31:42
问题 Is there a way to configure the robots.txt so that the site accepts visits ONLY from Google, Yahoo! and MSN spiders? 回答1: User-agent: * Disallow: / User-agent: Googlebot Allow: / User-agent: Slurp Allow: / User-Agent: msnbot Disallow: Slurp is Yahoo's robot 回答2: Why? Anyone doing evil (e.g., gathering email addresses to spam) will just ignore robots.txt. So you're only going to be blocking legitimate search engines, as robots.txt compliance is voluntary. But — if you insist on doing it anyway

Why is my sitemap file considered empty?

谁都会走 提交于 2019-12-12 03:09:34
问题 I have a robots.txt file in the root of my site which has this one line in it: Sitemap: http://www.awardwinnersonly.com/sitemap.xml The sitemap.xml is also in the root of the site, and contains this text: <?xml version="1.0" encoding="UTF-8"?> -<urlset xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http:

Disallow pdf files from indexing (Robots.txt)

丶灬走出姿态 提交于 2019-12-12 02:52:27
问题 I have links being indexed that shouldn't. I need to remove them from google. What should I enter to robots.txt Link example http://sitename.com/wp-content/uploads/2014/02/The-Complete-Program-2014.pdf 回答1: With robots.txt, you can disallow crawling , not indexing . With this robots.txt User-agent: * Disallow: /wp-content/uploads/2014/02/The-Complete-Program-2014.pdf any URL whose path starts with /wp-content/uploads/2014/02/The-Complete-Program-2014.pdf is not allowed to be crawled. But if a