How to stop search engines from crawling the whole website?

后端 未结 3 1229
逝去的感伤
逝去的感伤 2021-02-05 10:05

I want to stop search engines from crawling my whole website.

I have a web application for members of a company to use. This is hosted on a web server so that the emplo

相关标签:
3条回答
  • 2021-02-05 10:49

    Using robots.txt to keep a site out of search engine indexes has one minor and little-known problem: if anyone ever links to your site from any page indexed by Google (which would have to happen for Google to find your site anyway, robots.txt or not), Google may still index the link and show it as part of their search results, even if you don't allow them to fetch the page the link points to.

    If this might be a problem for you, the solution is to not use robots.txt, but instead to include a robots meta tag with the value noindex,nofollow on every page on your site. You can even do this in a .htaccess file using mod_headers and the X-Robots-Tag HTTP header:

    Header set X-Robots-Tag noindex,nofollow
    

    This directive will add the header X-Robots-Tag: noindex,nofollow to every page it applies to, including non-HTML pages like images. Of course, you may want to include the corresponding HTML meta tag too, just in case (it's an older standard, and so presumably more widely supported):

    <meta name="robots" content="noindex,nofollow" />
    

    Note that if you do this, Googlebot will still try to crawl any links it finds to your site, since it needs to fetch the page before it sees the header / meta tag. Of course, some might well consider this a feature instead of a bug, since it lets you look in your access logs to see if Google has found any links to your site.

    In any case, whatever you do, keep in mind that it's hard to keep a "secret" site secret very long. As time passes, the probability that one of your users will accidentally leak a link to the site approaches 100%, and if there's any reason to assume that someone would be interested in finding the site, you should assume that they will. Thus, make sure you also put proper access controls on your site, keep the software up to date and run regular security checks on it.

    0 讨论(0)
  • 2021-02-05 10:57

    It is best handled with a robots.txt file, for just bots that respect the file.

    To block the whole site add this to robots.txt in the root directory of your site:

    User-agent: *
    Disallow: /
    

    To limit access to your site for everyone else, .htaccess is better, but you would need to define access rules, by IP address for example.

    Below are the .htaccess rules to restrict everyone except your people from your company IP:

    Order allow,deny
    # Enter your companies IP address here
    Allow from 255.1.1.1
    Deny from all 
    
    0 讨论(0)
  • 2021-02-05 11:08

    If security is your concern, and locking down to IP addresses isn't viable, you should look into requiring your users to authenticate in someway to access your site.

    That would mean that anyone (google, bot, person-who-stumbled-upon-a-link) who isn't authenticated, wouldn't be able to access your pages.

    You could bake it into your website itself, or use HTTP Basic Authentication.

    https://www.httpwatch.com/httpgallery/authentication/

    0 讨论(0)
提交回复
热议问题