I'm looking into some SEO issues for my home business site, and keep seeing that Google expects to see a robots.txt file. For SEO? But then I read that lots of robots just ignore it anyway, and of course we WANT Google to keep it indexed, right? So why exactly do I need a robots.txt , and what should I put in it?
You're right most bots will ignore your robots.txt If you want Google to crawl your entire website put this into your robots.txt: User-agent: * Disallow: Code (markup): The truth is your website will get crawled without robots.txt. It's mostly used to disallow Google (and other bots) to crawl certain parts of a website.
I was going without one completely, but keep seeing that for SEO purposes, Google looks for one. So I just did this one, to exclude my test directory from being indexed, anyway: User-agent: * Disallow: /test/
If NOTHING in your test directory is EVER linked to the outside world, NO robot can find it nor index it, in which case you do not need a robots.txt file for it. However, be aware that ROGUE bots do NOT honor a robots.txt file, but instead index the links anyway. Having a robots.txt file for a test directory tells the rogue bots where the test file is located, AND THEY WILL INDEX IT IF THEY FIND YOUR robots.txt file. I know this because it has happened to me. I had to move everything, then create a fake page with fake information for the rogue bots to index to repair the damage. Best bet is to remove your test file from the robots.txt file before a rogue bot finds it and indexes your test file. Then make sure that you NEVER link your test file to the outside world. You can also put noindex and nofollow tags in your META tags for the test file which will stop all honorable bots from indexing it. Finally, you can password protect your test file which keeps EVERYONE out except those who know the password.
If I want to disallow several directories in the main domain root, is it done this way?: User-agent: * Disallow: /test/ /JPG/ /PHOT/ /VID/
Not sure what you mean by linked. The test directory is a subdirectory of my main domain, so the main domain is indexed. I'm not familiar with these terms yet. Rogue bots? Are they looking for things to steal? My test directory is just a way to test what I intend to put in the indexed domain later. There's nothing that really needs securing, that I know of. But for what purpose would they index it? I have no secret or security info in it. As far as password protection, I made a directory like that years ago, eventually forgot the password - even what was in it, and it still lingers in my ISP's account for me. I don't even know if they can remove it. I've asked, to no avail.
No, the correct way is this: User-agent: * Disallow: /test/ Disallow: /JPG/ Disallow: /PHOT/ Disallow: /VID/
Is there a link on ANY PUBLIC WEBPAGE ANYWHERE ON THE INTERNET that will open up your test pages? If so, you have outside links. A rogue bot is a bot that is rogue, which is a bot that DOES NOT FOLLOW THE RULES, which WILL index pages where you specifically DISALLOW indexing. Yahoo comes to mind here as Yahoo bots indexed my private pages EVEN THOUGH I DISALLOWED THEM and unfortunately my private information has spread across the web for anyone to use/abuse. Indexing bots generally are not looking to steal anything as they exist for only one purpose, to index pages. The problem lies in the fact that once your private pages are indexed, the WHOLE world can now access your private pages and they can NEVER be unindexed. Should you have any private information on those pages (now or in the future), anyone else can see it or even steal it. Bots do not need a reason to index. They just index EVERYTHING that they find. If you do not want something indexed, you MUST prevent them from indexing it. Disallowing and noindexing is not enough. If bots can never find your pages, you will never need to password them to stop the bots.