Googlebot not following the rules

Discussion in 'robots.txt' started by C. Szeler, Jun 25, 2006.

  1. #1
    Over the past month, I've noticed Google's spidering being obsessed with constantly crawling pages which are disallowed in the robots.txt. The IPs check out, so its not an imposter bot... but it's totally hammering some phpbb pages I've disallowed such as login.php and posting.php.

    Here's what it looks like (notice hits are anywhere from 1-3 seconds apart).

    66.249.66.4 - - [25/Jun/2006:07:45:23 -0400] "GET /forum/login.php?redirect=posting.php&mode=reply&t=5462&sid=50db820c1f8c72c005e0b3c26e41074e HTTP/1.1" 200 3091 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
    66.249.66.4 - - [25/Jun/2006:07:45:24 -0400] "GET /forum/posting.php?mode=quote&p=225484&sid=1cd6df44f909d3ad262197073e09be46 HTTP/1.1" 302 26 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"


    Does anybody have a clue about how to get google the heck off these pages?

    Robots.txt snip:
    Disallow: /forums/posting.php
    Disallow: /forums/login.php
    Code (markup):
     
    C. Szeler, Jun 25, 2006 IP
  2. mvandemar

    mvandemar Notable Member

    Messages:
    2,409
    Likes Received:
    307
    Best Answers:
    0
    Trophy Points:
    230
    #2
    That's because the pages getting hit are not the same as the pages in your robots.txt file according to Google. The best answer I can see would be to move those pages to their own subdirectories, and just disallow everything from there. Of course, it means redoing all your linking, which is a major pita depending on what software you are using.

    -Michael
     
    mvandemar, Jun 25, 2006 IP
  3. vagrant

    vagrant Peon

    Messages:
    2,284
    Likes Received:
    181
    Best Answers:
    0
    Trophy Points:
    0
    #3
    GET /forum/.....

    you disallowed forums are you sure you have the correct location


    also as google will accept a * could try
    Disallow: /forums/posting.ph*
     
    vagrant, Jun 25, 2006 IP
  4. Matts

    Matts Berserker

    Messages:
    195
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    108
    #4
    How about adding two more lines with wildcards:
    Disallow: /forums/posting.php
    Disallow: /forums/login.php
    Disallow: /forums/posting.php*
    Disallow: /forums/login.php*

    Goggle supposedly will understand the wildcard match. Can't hurt to try it.
     
    Matts, Jun 25, 2006 IP
  5. mvandemar

    mvandemar Notable Member

    Messages:
    2,409
    Likes Received:
    307
    Best Answers:
    0
    Trophy Points:
    230
    #5
    Good catch, missed that. :)

    -Michael
     
    mvandemar, Jun 25, 2006 IP
  6. tonyinabox

    tonyinabox Peon

    Messages:
    1,988
    Likes Received:
    42
    Best Answers:
    0
    Trophy Points:
    0
    #6
    because the bot thought it is different page. (with and without querystring)
     
    tonyinabox, Jun 25, 2006 IP
  7. C. Szeler

    C. Szeler Member

    Messages:
    64
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    43
    #7
    Sorry for not being clear. I just edited the subdirectory to provide a clear example. Infact it is *forums*.. the directories/pages in question are a little bit deeper, so for the sake of example, there was no mis-spellings.

    So do you figure
    Disallow: /forums/posting.php
    Disallow: /forums/login.php
    Disallow: /forums/posting.php*
    Disallow: /forums/login.php*
    Code (markup):
    will do the trick?
     
    C. Szeler, Jun 25, 2006 IP
  8. Jean-Luc

    Jean-Luc Peon

    Messages:
    601
    Likes Received:
    30
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Hi,

    Do you mean that you edited the two lines from your log file (forums replaced by forum) ?
    Regarding your last question, the answer is "Definitely not!". Adding the "*" at the end will not improve anything.

    Jean-Luc

    P.S. could you give us the URL of your site ?
     
    Jean-Luc, Jun 25, 2006 IP
  9. fsmedia

    fsmedia Prominent Member

    Messages:
    5,163
    Likes Received:
    262
    Best Answers:
    0
    Trophy Points:
    390
    #9
    You really need to use

    Disallow: /forum/posting.php

     
    fsmedia, Jun 25, 2006 IP
  10. C. Szeler

    C. Szeler Member

    Messages:
    64
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    43
    #10
    Let me try to clarify again. /forums/ and /forum/ do not exist. /forums/ is used purely as an example and (/forum/) was a mis-spelling.


    The main point is: How do I keep Google off of these pages?
    
    Disallow: /forums/posting.php
    Disallow: /forums/login.php
    
    Code (markup):
    Is not working! I have since made it look like this:
    
    Disallow: /forums/posting.php
    Disallow: /forums/login.php
    Disallow: /forums/posting.php*
    Disallow: /forums/login.php*
    
    Code (markup):
    I'm not about to email Google though, and complain they are indexing my site.
     
    C. Szeler, Jun 25, 2006 IP
  11. mvandemar

    mvandemar Notable Member

    Messages:
    2,409
    Likes Received:
    307
    Best Answers:
    0
    Trophy Points:
    230
    #11
    Ok, what did you mean by forum and forums do not exist? Was the example log that you posted real? If so (because it really was an overkill to fake that much detail), then the pages you listed would not be restricted by the example robots.txt that you posted here. Is that what your actual robots.txt looks like...?

    -Michael
     
    mvandemar, Jun 25, 2006 IP
  12. lorien1973

    lorien1973 Notable Member

    Messages:
    12,206
    Likes Received:
    601
    Best Answers:
    0
    Trophy Points:
    260
    #12
    don't you need to disallow the ?'s ?

    Disallow: /forums/posting.php?
    Disallow: /forums/login.php?

    See google's robots file. Thats what they do.

     
    lorien1973, Jun 25, 2006 IP
  13. Jean-Luc

    Jean-Luc Peon

    Messages:
    601
    Likes Received:
    30
    Best Answers:
    0
    Trophy Points:
    0
    #13
    I confirm my previous post : adding the "*" at the end will not improve anything.

    Would you give us the URL of your site, we could probably help more effectively.

    Jean-Luc
     
    Jean-Luc, Jun 25, 2006 IP
  14. C. Szeler

    C. Szeler Member

    Messages:
    64
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    43
    #14
    The URL of the site is pretty irrelevant. I'm trying to illustrate that Google is ignoring the robots.txt file, and no, I just edited the "forum" and "forums" part of both quotes to simply the situation.

    So do you gather that adding a ? as opposed to a * would make a difference? That seems illogical.

    Edit: Okay, here's a snippet from the new robots.txt:

    
    Disallow: /forums/posting.php
    Disallow: /forums/posting.php*
    Disallow: /forums/posting.php?
    Disallow: /forums/login.php*
    Disallow: /forums/login.php
    Disallow: /forums/login.php?
    
    Code (markup):
     
    C. Szeler, Jun 25, 2006 IP
  15. markhutch

    markhutch Peon

    Messages:
    357
    Likes Received:
    22
    Best Answers:
    0
    Trophy Points:
    0
    #15
    Between all the SE's I've found Google to be the best at following robots.txt instructions, if the robots.txt is constructed correctly. It's hard to understand what the problem might be without something real to look at. To claim Googlebot isn't following robots.txt without giving some example website is going to be hard for some here to believe your claims. Of course, if the robots.txt page is new then Google may not be following it as of yet. I've seen it take two weeks before the crawler caught up with new robots.txt instructions.
     
    markhutch, Jun 25, 2006 IP
  16. mvandemar

    mvandemar Notable Member

    Messages:
    2,409
    Likes Received:
    307
    Best Answers:
    0
    Trophy Points:
    230
    #16
    The URL would allow us to see for ourselves what your structure and robots.txt looked like and if they matched.

    Is what you posted your ACTUAL robots.txt (because it sounds like it is)...? And what you posted in post #1 your ACTUAL log files (because it looks like a real log file)?

    Because, if so, then there is no reason to believe that Googlebot is ignoring anything.

    -Michael
     
    mvandemar, Jun 27, 2006 IP