Over the past month, I've noticed Google's spidering being obsessed with constantly crawling pages which are disallowed in the robots.txt. The IPs check out, so its not an imposter bot... but it's totally hammering some phpbb pages I've disallowed such as login.php and posting.php. Here's what it looks like (notice hits are anywhere from 1-3 seconds apart). 66.249.66.4 - - [25/Jun/2006:07:45:23 -0400] "GET /forum/login.php?redirect=posting.php&mode=reply&t=5462&sid=50db820c1f8c72c005e0b3c26e41074e HTTP/1.1" 200 3091 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.66.4 - - [25/Jun/2006:07:45:24 -0400] "GET /forum/posting.php?mode=quote&p=225484&sid=1cd6df44f909d3ad262197073e09be46 HTTP/1.1" 302 26 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" Does anybody have a clue about how to get google the heck off these pages? Robots.txt snip: Disallow: /forums/posting.php Disallow: /forums/login.php Code (markup):
That's because the pages getting hit are not the same as the pages in your robots.txt file according to Google. The best answer I can see would be to move those pages to their own subdirectories, and just disallow everything from there. Of course, it means redoing all your linking, which is a major pita depending on what software you are using. -Michael
GET /forum/..... you disallowed forums are you sure you have the correct location also as google will accept a * could try Disallow: /forums/posting.ph*
How about adding two more lines with wildcards: Disallow: /forums/posting.php Disallow: /forums/login.php Disallow: /forums/posting.php* Disallow: /forums/login.php* Goggle supposedly will understand the wildcard match. Can't hurt to try it.
Sorry for not being clear. I just edited the subdirectory to provide a clear example. Infact it is *forums*.. the directories/pages in question are a little bit deeper, so for the sake of example, there was no mis-spellings. So do you figure Disallow: /forums/posting.php Disallow: /forums/login.php Disallow: /forums/posting.php* Disallow: /forums/login.php* Code (markup): will do the trick?
Hi, Do you mean that you edited the two lines from your log file (forums replaced by forum) ? Regarding your last question, the answer is "Definitely not!". Adding the "*" at the end will not improve anything. Jean-Luc P.S. could you give us the URL of your site ?
Let me try to clarify again. /forums/ and /forum/ do not exist. /forums/ is used purely as an example and (/forum/) was a mis-spelling. The main point is: How do I keep Google off of these pages? Disallow: /forums/posting.php Disallow: /forums/login.php Code (markup): Is not working! I have since made it look like this: Disallow: /forums/posting.php Disallow: /forums/login.php Disallow: /forums/posting.php* Disallow: /forums/login.php* Code (markup): I'm not about to email Google though, and complain they are indexing my site.
Ok, what did you mean by forum and forums do not exist? Was the example log that you posted real? If so (because it really was an overkill to fake that much detail), then the pages you listed would not be restricted by the example robots.txt that you posted here. Is that what your actual robots.txt looks like...? -Michael
don't you need to disallow the ?'s ? Disallow: /forums/posting.php? Disallow: /forums/login.php? See google's robots file. Thats what they do.
I confirm my previous post : adding the "*" at the end will not improve anything. Would you give us the URL of your site, we could probably help more effectively. Jean-Luc
The URL of the site is pretty irrelevant. I'm trying to illustrate that Google is ignoring the robots.txt file, and no, I just edited the "forum" and "forums" part of both quotes to simply the situation. So do you gather that adding a ? as opposed to a * would make a difference? That seems illogical. Edit: Okay, here's a snippet from the new robots.txt: Disallow: /forums/posting.php Disallow: /forums/posting.php* Disallow: /forums/posting.php? Disallow: /forums/login.php* Disallow: /forums/login.php Disallow: /forums/login.php? Code (markup):
Between all the SE's I've found Google to be the best at following robots.txt instructions, if the robots.txt is constructed correctly. It's hard to understand what the problem might be without something real to look at. To claim Googlebot isn't following robots.txt without giving some example website is going to be hard for some here to believe your claims. Of course, if the robots.txt page is new then Google may not be following it as of yet. I've seen it take two weeks before the crawler caught up with new robots.txt instructions.
The URL would allow us to see for ourselves what your structure and robots.txt looked like and if they matched. Is what you posted your ACTUAL robots.txt (because it sounds like it is)...? And what you posted in post #1 your ACTUAL log files (because it looks like a real log file)? Because, if so, then there is no reason to believe that Googlebot is ignoring anything. -Michael