Help With Wordpress robots.txt file

Discussion in 'WordPress' started by savantcreative, Jan 2, 2011.

  1. #1
    I just put up a new hosted Wordpress blog and am having trouble figuring out what my robots.txt files should allow and disallow. Can someone explain this and possibly send me a sample file? The Wordpress site was not very helpfull. Thanks
     
    savantcreative, Jan 2, 2011 IP
  2. Sweely

    Sweely Well-Known Member

    Messages:
    1,467
    Likes Received:
    15
    Best Answers:
    0
    Trophy Points:
    165
    #2
    Sweely, Jan 3, 2011 IP
  3. KimiGermany

    KimiGermany Peon

    Messages:
    1,117
    Likes Received:
    15
    Best Answers:
    0
    Trophy Points:
    0
    #3
    KimiGermany, Jan 3, 2011 IP
  4. mtwzh

    mtwzh Peon

    Messages:
    4
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #4
    User-agent: *
    Disallow: /cgi-bin
    Disallow: /wp-admin
    Disallow: /wp-includes
    Disallow: /wp-content
    Disallow: /tag
    Disallow: /author
    Disallow: /wget/
    Disallow: /httpd/
    Disallow: /i/
    Disallow: /f/
    Disallow: /t/
    Disallow: /c/
    Disallow: /j/

    User-agent: Mediapartners-Google
    Allow: /

    User-agent: Adsbot-Google
    Allow: /

    User-agent: Googlebot-Image
    Allow: /

    User-agent: Googlebot-Mobile
    Allow: /

    User-agent: ia_archiver-web.archive.org
    Disallow: /

    Sitemap: xxxxxx
     
    mtwzh, Jan 3, 2011 IP
  5. savantcreative

    savantcreative Peon

    Messages:
    139
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #5
    savantcreative, Jan 4, 2011 IP
  6. savantcreative

    savantcreative Peon

    Messages:
    139
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #6
    savantcreative, Jan 4, 2011 IP
  7. SEOaaron

    SEOaaron Peon

    Messages:
    107
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #7
    Wordpress automatically creates on on the fly for you
     
    SEOaaron, Jan 7, 2011 IP
  8. savantcreative

    savantcreative Peon

    Messages:
    139
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #8
    No. Please reread my post to see what I mean. I am talking about a redirect and a rewrite. Thanks
     
    savantcreative, Jan 8, 2011 IP
  9. Dodger

    Dodger Peon

    Messages:
    1,494
    Likes Received:
    60
    Best Answers:
    0
    Trophy Points:
    0
    #9
    Not sure you would even want to do that. Robots.txt file needs to be read by the robots ... if you deliver another page in its place, then you defeat the purpose.

    Is there any reason why you do not want people to see the file in the first place?
     
    Dodger, Jan 8, 2011 IP
  10. savantcreative

    savantcreative Peon

    Messages:
    139
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #10
    I am curious because I see a bunch of blogs doing this.
     
    savantcreative, Jan 9, 2011 IP
  11. Dodger

    Dodger Peon

    Messages:
    1,494
    Likes Received:
    60
    Best Answers:
    0
    Trophy Points:
    0
    #11
    You have? What blogs do you know of that redirects robots.txt to the homepage? If this is true, then I would like to investigate that further.
     
    Dodger, Jan 9, 2011 IP
  12. savantcreative

    savantcreative Peon

    Messages:
    139
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #12
    I just sent you a PM. Please let me know what you think. Thanks
     
    savantcreative, Jan 10, 2011 IP
  13. Dodger

    Dodger Peon

    Messages:
    1,494
    Likes Received:
    60
    Best Answers:
    0
    Trophy Points:
    0
    #13
    Got the PM, Thanks.

    My best guess is some type of cloaking.

    Option 1
    Brett Tabke at WebmasterWorld cloaks his robots.txt file:

    http://www.webmasterworld.com/robots.txt

    You will notice that everything is disallowed for everyone. Curious, eh?

    He redirects his robots.txt file to a Perl script (most probably in .htaccess), the contents of which can be found here:

    http://www.webmasterworld.com/robots.txt?view=producecode

    Inside that script, he allows good bots in to see the actual robots.txt file. All others, including visitors, get the bogus robots.txt file which can also easily be a homepage if you wish (not recommended for bad bots, they should get the disallow file).

    The Perl script can be easily adapted to Php with a little research.

    Option 2
    Using .htaccess to only allow certain user-agents to see the actual robots.txt file.

    In theory (I am no mod_rewrite expert) this is the basic approach, and it is far from tested. I am just throwing this out there.

    
    RewriteEngine On
    RewriteBase /
    RewriteCond %{REQUEST_URI} ^/robots.txt$
    RewriteCond %{HTTP_USER_AGENT} ^Googlebot [OR] 
    RewriteCond %{HTTP_USER_AGENT} ^Slurp [OR] 
    RewriteCond %{HTTP_USER_AGENT} ^MSNbot [OR] 
    RewriteCond %{HTTP_USER_AGENT} ^AnotherNiceBot [OR]  
    RewriteCond %{HTTP_USER_AGENT} ^AnotherNiceBotEtc   
    RewriteRule . /robots2.txt [L]
    
    Code (markup):
    This should, maybe, I dunno, probably work. It should redirect all good bots to your the real robots file "robots2.txt". Nobody else should be able to see it, except bot spoofers (which is another topic altogether).
     
    Dodger, Jan 10, 2011 IP
  14. savantcreative

    savantcreative Peon

    Messages:
    139
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #14
    Do you think it is worth getting involved with? Thanks.
     
    savantcreative, Jan 11, 2011 IP
  15. Dodger

    Dodger Peon

    Messages:
    1,494
    Likes Received:
    60
    Best Answers:
    0
    Trophy Points:
    0
    #15
    That is up to you. Personally, I would not bother with it that much.

    I would be more apt to blocking bad robots out-right though, especially if the blog gets tied into social websites. When you start Digging, Tweeting, FB Liking, etc. then a whole rash of bots come out of the woodwork. Some good, mostly bad IMO, and are not well behaved.
     
    Dodger, Jan 11, 2011 IP
  16. savantcreative

    savantcreative Peon

    Messages:
    139
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #16
    Thanks. How can I find out how to keep the bad bots out?
     
    savantcreative, Jan 12, 2011 IP
  17. Dodger

    Dodger Peon

    Messages:
    1,494
    Likes Received:
    60
    Best Answers:
    0
    Trophy Points:
    0
    #17
    A somewhat dated article, but still very useful, is AskApache's Blocking Bad Bots and Scrapers with .htaccess article. Please note that you may not need the entire list in your .htaccess file and probably is not advisable since most of these do not exist anymore, at least, I have not seen them in my logs.

    The bots that really piss me off are right after a blog tweets (or retweets) a link, then they all come out of the woodwork. Most of those bots seem to come from Amazon cloud computing services (not Amazon per say, people using their service) and seems to be a trend. These bots use Open Source or free scripts and rather badly behaved, sometimes no UserAgent or it is spoofed, after the tweet they they randomly rifle through several pages, come back and hit the same page again, etc. yada, yada, yada.
     
    Dodger, Jan 13, 2011 IP
  18. savantcreative

    savantcreative Peon

    Messages:
    139
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #18
    Thanks, Dodger. I will read this today.
     
    savantcreative, Jan 14, 2011 IP