How can I keep my robots.txt file from being accessed by vistors? I have a couple of pages hidden from search engines because I don't want them included in searches, allowing users to access them. Right now, anyone can go to mysite.com/robots.txt, see the path of the file, then access it directly. I know I can move the files into a directory, then block the entire directory, but that just gives hackers another place to look. So how can I keep robots.txt from being accessed by humans all together?
you can use short record of robots.txt e.g. if you want to disallow /hide-this-page-for-visitors.html use: "Disallow: /hide-this" ... and it will disallow /hide-this-page-for-visitors.html but also /hide-this-page-for-bots.html ... simply said every file and folder in the disallowed path with prefix "hide-this" I hope it helps ... if you want to disallow admin page, I recommend use password protection and you will have no problem. or rename the admin folder to e.g. admin01sdsad5sd and use "Disallow: /admin" but I think password protection of the folder is better
You can, but it's too complex to even try. You can use .htaccess Forbidden for * except Googlebot's IP.
Hi there, Put this code in your .htaccess file: <Files robots.txt> Order Deny,Allow Deny from All </Files> Code (markup):
Scorpiono: Yup, but you have to check if Bot's IP is still the same. Or you can be crawled by another one (I think google bot does not have just one IP). It is waste of time. if it should work, you should always check it. it will take you several hours each month. Otherwise you can miss any IP of the bot and your secret pages will be online. Ikki: I think easier way is delete the file than use your code. Why? Bcoz your code deny the file to ALL. So also to bots. So it is what he does not want. latoya: Also another tip - add <meta name="robots" content="noindex" /> to head tag of pages which you want to exclude from index. But the best is use this + also robots - e.g. like I wrote in my previous post ...
Actually, you're right goliathus. I forgot to add something in that code. Here it is, again: <Files robots.txt> Order Deny,Allow Deny from All Allow from googlebot.com google.com google-analytics.com </Files> Code (markup): (This would deny access to everyone but Google)
BTW, you should also use that code to allow access to other SE spiders like YAhoo Slurp. Otherwise Google would be the only one to find it.
Are you sure that google bot use hostname? I don't think so ... I am skeptic on this ... I still think my way is most safe ... bcoz once when google will start new bot which will not be allowed in list, your secret pages will be online, bcoz the bot will not see them as disallowed...
Please read here: http://www.robotstxt.org/db/googlebot.html Ok, let's say that Google releases tomorrow a new bot called ICrawlSites. Since ICrawlSites is not on the "whitelist" (see third line of .htaccess code) it won't be granted access to robots.txt therefore won't see those hidden pages our friend latoya is trying to keep secret.
Thanks everyone. I used the solution recommended by Bagi Zoltan. I'll watch my stats to make sure the bots are still able to crawl my site.
Noindex meta tag on those pages would be the easiest way, or better yet just password protecting those folders or pages.
Good question and for most sites, it's probably not needed. That being said, if you have a website where you have sensitive data OR you have a page or directory where you want to make sure a search engine like GOOGLE will never index it, some people like to put a disallow for the folder or page. For instance, perhaps you have a site and there's a folder called: http://website.com/userdata/ Normally you need to use a password to get into that page/folder however if you update your code and or you have a code bug, it might be possibly that a robot/search engine could get in there and spider the entire page/folder and now it's all on google. So to protect against this "edge case", the webmaster might put a Disallow: /userdata into his robots.txt file just to ensure Google never crawls it. Well now guess what, you in a way just told hackers that you have a subfolder called userdata on your site and now they know that if they attack your site, they might be able to get into the folder and get some userdata. Sooo.... this might be one reason why you would want to hide the robots.txt file IE: You don't want to give anything away.... A better option would be to use a meta tag noindex however, that's a code update and some ppl have their hands tied with that especially if the data in the "sensitive" folder is a blob of data (.pdf??) rather than a web page.