How do I write my robots.txt to exclude SSL pages only?

Discussion in 'robots.txt' started by amanamission, Sep 18, 2007.

  1. #1
    Okay, I have never written robots.txt before, because I'm scared about making a mistake.
    Now I need to do it and do it right.
    I discovered that some of my indexing problems are related to my SSL certificate, which generates duplicate copies of my pages that Google is confiusing with my main page. A 301 redirect fixed this right up-but the problem is that my SSL certifcate, and hence my entire ordering set-up, no longer worked, since it was trying to access SSL URL's but being directed to the http instead.
    Here's what I want to do: exclude ONLY the ssl pages from indexing.

    .sslpowered.mydomain.com/mydomain.com
    but I want all of my regular pages to still be crawled.

    How can I write this properly, and how can I verify that I did not exclude my http: pages?
     
    amanamission, Sep 18, 2007 IP
  2. amanamission

    amanamission Notable Member

    Messages:
    1,936
    Likes Received:
    138
    Best Answers:
    0
    Trophy Points:
    210
    #2
    Okay, so this has turned out to be more complicated than I originally thought.

    I found a two-part solution involving mod_rewrite and two robots.txt here:

    http://www.seoworkers.com/seo-articles-tutorials/robots-and-https.html

    In .htaccess
    RewriteEngine on
    Options +FollowSymlinks
    RewriteCond %{SERVER_PORT} ^443$
    RewriteRule ^robots.txt$ robots_ssl.txt

    in robots.txt
    User-agent: *
    Allow: /

    in robots_ssl.txt:
    User-agent: *
    Disallow: /

    So this ought to redirect all robots to the disallow file for any SSL/https pages, which is great, since I don't want those to be indexed, while allowing all robots to crawl http pages on my five domains.

    My only problem at this point is how to verify this. The regular robots.txt file is in my root directory, as is the ssl version, so I don't know if that's correct. There isn't really an address for it there-it's just www/robots.txt.
    Should I have this in the domain folder as well/instead?
    Anybody familiar with this who can put me on steadier ground with this?
     
    amanamission, Sep 19, 2007 IP
  3. jimkarter

    jimkarter Notable Member

    Messages:
    5,168
    Likes Received:
    347
    Best Answers:
    0
    Trophy Points:
    235
    #3
    Does Search engines search for https://www.domain.com/robots.txt before downloading any urls with https: ?? If yes, it will work, else it will not.

    Yes. It should be in the root folder of your https: url.
     
    jimkarter, Sep 19, 2007 IP
  4. amanamission

    amanamission Notable Member

    Messages:
    1,936
    Likes Received:
    138
    Best Answers:
    0
    Trophy Points:
    210
    #4
    Well, webmaster tools gave me the first answer: it delivered a 404 for the robots.txt that wasn't in my domain directory, so, I've added the Allow file to all my directories. Still haven't decided if I need to have the SSL Robots in each directory either, since it's location is URL accessible as is.

    @jimkarter- As far as i can tell and according to this article, secure requests use a different port. That's what the mod_rewrite is for. It doesn't call the robots.txt first, it goes to the page and then gets directed to the alternate robots.txt for the disallow command.

    I'll let you know if it works. So far, it seems sound.
    Anybody friends with a Googlebot?
     
    amanamission, Sep 19, 2007 IP
  5. evera

    evera Peon

    Messages:
    283
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    0
    #5
    If you have a https folder u should use:
    RewriteCond %{HTTPS} on 
    RewriteRule ^robots\.txt$ robots-https.txt
    Code (markup):
     
    evera, Sep 20, 2007 IP
  6. Webnauts

    Webnauts Peon

    Messages:
    133
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #6
    If your pages are .php, you may try adding the following in your document headers:

    <?php
    if ( isset($_SERVER['HTTPS']) || (isset($_SERVER['HTTPS']) && strtolower($_SERVER['HTTPS'])) == 'on' ) {echo '<meta name="robots" content="noindex,nofollow,noarchive" />'."\n";}
    else {echo '<meta name="robots" content="index,follow" />'."\n";}
    ?>
     
    Webnauts, Dec 5, 2008 IP