I just put up a new hosted Wordpress blog and am having trouble figuring out what my robots.txt files should allow and disallow. Can someone explain this and possibly send me a sample file? The Wordpress site was not very helpfull. Thanks
I am using this: http://codex.wordpress.org/Search_Engine_Optimization_for_WordPress to create my robots.txt
User-agent: * Disallow: /cgi-bin Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content Disallow: /tag Disallow: /author Disallow: /wget/ Disallow: /httpd/ Disallow: /i/ Disallow: /f/ Disallow: /t/ Disallow: /c/ Disallow: /j/ User-agent: Mediapartners-Google Allow: / User-agent: Adsbot-Google Allow: / User-agent: Googlebot-Image Allow: / User-agent: Googlebot-Mobile Allow: / User-agent: ia_archiver-web.archive.org Disallow: / Sitemap: xxxxxx
Now. Here is the big question. Does anyone no how to help me create a situation in which if a visitor navigates to http://example.com/robots.txt they will see my index page but the url will be renamed http://example.com/robots.txt I believe this can be done with a .htaccess file. Thanks for your help.
Not sure you would even want to do that. Robots.txt file needs to be read by the robots ... if you deliver another page in its place, then you defeat the purpose. Is there any reason why you do not want people to see the file in the first place?
You have? What blogs do you know of that redirects robots.txt to the homepage? If this is true, then I would like to investigate that further.
Got the PM, Thanks. My best guess is some type of cloaking. Option 1 Brett Tabke at WebmasterWorld cloaks his robots.txt file: http://www.webmasterworld.com/robots.txt You will notice that everything is disallowed for everyone. Curious, eh? He redirects his robots.txt file to a Perl script (most probably in .htaccess), the contents of which can be found here: http://www.webmasterworld.com/robots.txt?view=producecode Inside that script, he allows good bots in to see the actual robots.txt file. All others, including visitors, get the bogus robots.txt file which can also easily be a homepage if you wish (not recommended for bad bots, they should get the disallow file). The Perl script can be easily adapted to Php with a little research. Option 2 Using .htaccess to only allow certain user-agents to see the actual robots.txt file. In theory (I am no mod_rewrite expert) this is the basic approach, and it is far from tested. I am just throwing this out there. RewriteEngine On RewriteBase / RewriteCond %{REQUEST_URI} ^/robots.txt$ RewriteCond %{HTTP_USER_AGENT} ^Googlebot [OR] RewriteCond %{HTTP_USER_AGENT} ^Slurp [OR] RewriteCond %{HTTP_USER_AGENT} ^MSNbot [OR] RewriteCond %{HTTP_USER_AGENT} ^AnotherNiceBot [OR] RewriteCond %{HTTP_USER_AGENT} ^AnotherNiceBotEtc RewriteRule . /robots2.txt [L] Code (markup): This should, maybe, I dunno, probably work. It should redirect all good bots to your the real robots file "robots2.txt". Nobody else should be able to see it, except bot spoofers (which is another topic altogether).
That is up to you. Personally, I would not bother with it that much. I would be more apt to blocking bad robots out-right though, especially if the blog gets tied into social websites. When you start Digging, Tweeting, FB Liking, etc. then a whole rash of bots come out of the woodwork. Some good, mostly bad IMO, and are not well behaved.
A somewhat dated article, but still very useful, is AskApache's Blocking Bad Bots and Scrapers with .htaccess article. Please note that you may not need the entire list in your .htaccess file and probably is not advisable since most of these do not exist anymore, at least, I have not seen them in my logs. The bots that really piss me off are right after a blog tweets (or retweets) a link, then they all come out of the woodwork. Most of those bots seem to come from Amazon cloud computing services (not Amazon per say, people using their service) and seems to be a trend. These bots use Open Source or free scripts and rather badly behaved, sometimes no UserAgent or it is spoofed, after the tweet they they randomly rifle through several pages, come back and hit the same page again, etc. yada, yada, yada.