How to hide robots.txt from visitors

Discussion in 'robots.txt' started by latoya, Aug 23, 2008.

  1. #1
    How can I keep my robots.txt file from being accessed by vistors?

    I have a couple of pages hidden from search engines because I don't want them included in searches, allowing users to access them.

    Right now, anyone can go to mysite.com/robots.txt, see the path of the file, then access it directly.

    I know I can move the files into a directory, then block the entire directory, but that just gives hackers another place to look.

    So how can I keep robots.txt from being accessed by humans all together?
     
    latoya, Aug 23, 2008 IP
  2. hues

    hues Member

    Messages:
    107
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    26
    #2
    dont have any idea, you might check yahoo ask.
     
    hues, Aug 24, 2008 IP
  3. goliathus

    goliathus Peon

    Messages:
    93
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #3
    you can use short record of robots.txt

    e.g. if you want to disallow /hide-this-page-for-visitors.html use: "Disallow: /hide-this" ... and it will disallow /hide-this-page-for-visitors.html but also /hide-this-page-for-bots.html ... simply said every file and folder in the disallowed path with prefix "hide-this"

    I hope it helps ... if you want to disallow admin page, I recommend use password protection and you will have no problem. or rename the admin folder to e.g. admin01sdsad5sd and use "Disallow: /admin" :) but I think password protection of the folder is better :)
     
    goliathus, Aug 28, 2008 IP
  4. Scorpiono

    Scorpiono Well-Known Member

    Messages:
    1,330
    Likes Received:
    35
    Best Answers:
    0
    Trophy Points:
    120
    #4
    You can, but it's too complex to even try.
    You can use .htaccess Forbidden for * except Googlebot's IP.
     
    Scorpiono, Aug 28, 2008 IP
  5. Ikki

    Ikki Peon

    Messages:
    474
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Hi there,

    Put this code in your .htaccess file:
    <Files robots.txt>
    Order Deny,Allow
    Deny from All
    </Files>
    Code (markup):
     
    Ikki, Aug 28, 2008 IP
  6. goliathus

    goliathus Peon

    Messages:
    93
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Scorpiono: Yup, but you have to check if Bot's IP is still the same. Or you can be crawled by another one (I think google bot does not have just one IP). It is waste of time. if it should work, you should always check it. it will take you several hours each month. Otherwise you can miss any IP of the bot and your secret pages will be online.

    Ikki: I think easier way is delete the file than use your code. Why? Bcoz your code deny the file to ALL. So also to bots. So it is what he does not want.

    latoya: Also another tip - add <meta name="robots" content="noindex" /> to head tag of pages which you want to exclude from index. But the best is use this + also robots - e.g. like I wrote in my previous post ...
     
    goliathus, Aug 28, 2008 IP
  7. Ikki

    Ikki Peon

    Messages:
    474
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #7
    Actually, you're right goliathus. I forgot to add something in that code.

    Here it is, again:
    <Files robots.txt>
    Order Deny,Allow
    Deny from All
    Allow from googlebot.com google.com google-analytics.com
    </Files>
    Code (markup):
    (This would deny access to everyone but Google)
     
    Ikki, Aug 28, 2008 IP
  8. Ikki

    Ikki Peon

    Messages:
    474
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #8
    BTW, you should also use that code to allow access to other SE spiders like YAhoo Slurp. Otherwise Google would be the only one to find it.
     
    Ikki, Aug 28, 2008 IP
  9. goliathus

    goliathus Peon

    Messages:
    93
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #9
    Are you sure that google bot use hostname? I don't think so ... I am skeptic on this ... I still think my way is most safe ... bcoz once when google will start new bot which will not be allowed in list, your secret pages will be online, bcoz the bot will not see them as disallowed...
     
    goliathus, Aug 28, 2008 IP
  10. Ikki

    Ikki Peon

    Messages:
    474
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #10
    Please read here: http://www.robotstxt.org/db/googlebot.html
    Ok, let's say that Google releases tomorrow a new bot called ICrawlSites. Since ICrawlSites is not on the "whitelist" (see third line of .htaccess code) it won't be granted access to robots.txt therefore won't see those hidden pages our friend latoya is trying to keep secret.
     
    Ikki, Aug 28, 2008 IP
  11. Bagi Zoltán

    Bagi Zoltán Well-Known Member

    Messages:
    364
    Likes Received:
    23
    Best Answers:
    0
    Trophy Points:
    110
    #11
    You may find this resource useful, and not difficoult to implement at all.:eek:
     
    Bagi Zoltán, Aug 28, 2008 IP
    latoya likes this.
  12. latoya

    latoya Active Member

    Messages:
    749
    Likes Received:
    73
    Best Answers:
    0
    Trophy Points:
    70
    #12
    Thanks everyone. I used the solution recommended by Bagi Zoltan. I'll watch my stats to make sure the bots are still able to crawl my site.
     
    latoya, Aug 31, 2008 IP
  13. justinlorder

    justinlorder Peon

    Messages:
    4,160
    Likes Received:
    61
    Best Answers:
    0
    Trophy Points:
    0
    #13
    Is it really neccessary to hidden the robots.txt from user ? why ?
     
    justinlorder, Aug 31, 2008 IP
  14. ajrenk

    ajrenk Peon

    Messages:
    11
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #14
    I had the same problem and Bagi's solution worked great! Thanks
     
    ajrenk, Sep 1, 2008 IP
  15. mdvaldosta

    mdvaldosta Peon

    Messages:
    4,079
    Likes Received:
    362
    Best Answers:
    0
    Trophy Points:
    0
    #15
    Noindex meta tag on those pages would be the easiest way, or better yet just password protecting those folders or pages.
     
    mdvaldosta, Sep 1, 2008 IP
  16. Kentucky.Star

    Kentucky.Star Peon

    Messages:
    46
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #16
    Same thought here , why wanna hide it , its just a small text file ,innit ?
     
    Kentucky.Star, Sep 5, 2008 IP
  17. mrgroove

    mrgroove Peon

    Messages:
    1
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #17
    Good question and for most sites, it's probably not needed. That being said, if you have a website where you have sensitive data OR you have a page or directory where you want to make sure a search engine like GOOGLE will never index it, some people like to put a disallow for the folder or page.

    For instance, perhaps you have a site and there's a folder called:

    http://website.com/userdata/

    Normally you need to use a password to get into that page/folder however if you update your code and or you have a code bug, it might be possibly that a robot/search engine could get in there and spider the entire page/folder and now it's all on google. So to protect against this "edge case", the webmaster might put a Disallow: /userdata into his robots.txt file just to ensure Google never crawls it.

    Well now guess what, you in a way just told hackers that you have a subfolder called userdata on your site and now they know that if they attack your site, they might be able to get into the folder and get some userdata. Sooo.... this might be one reason why you would want to hide the robots.txt file IE: You don't want to give anything away....

    A better option would be to use a meta tag noindex however, that's a code update and some ppl have their hands tied with that especially if the data in the "sensitive" folder is a blob of data (.pdf??) rather than a web page.
     
    mrgroove, Apr 1, 2010 IP