robots.txt issues

mad4 Peon

Messages:: 6,986

Likes Received:: 493

Best Answers:: 0

Trophy Points:: 0

#1

Google is indexing urls that I have blocked using robots.txt.

If I use Disallow: /page.php should that stop page.php?id=123 as well. Its not working.

Google sitemaps says "URLs restricted by robots.txt 173" and yet all these 173 pages are indexed.

Anyone else noticing this?

From Google

To remove directories or individual pages of your website, you can place a robots.txt file at the root of your server. For information on how to create a robots.txt file, see the The Robot Exclusion Standard. When creating your robots.txt file, please keep the following in mind: When deciding which pages to crawl on a particular host, Googlebot will obey the first record in the robots.txt file with a User-agent starting with "Googlebot." If no such entry exists, it will obey the first entry with a User-agent of "*". Additionally, Google has introduced increased flexibility to the robots.txt file standard through the use asterisks. Disallow patterns may include "*" to match any sequence of characters, and patterns may end in "$" to indicate the end of a name.

To remove all pages under a particular directory (for example, lemurs), you'd use the following robots.txt entry:

User-agent: Googlebot
Disallow: /lemurs
Click to expand...

mad4, Dec 7, 2006 IP

dilute Peon

Messages:: 232

Likes Received:: 10

Best Answers:: 0

Trophy Points:: 0

#2

If they've been indexed once when they weren't mentioned in robots.txt it takes a long time for them to be removed and they resurface occasionally. Use the google form to request removal of that page. Also, you may want to try Disallow: /page.php? which google uses in its own robots.txt or even /page.php?*

dilute, Dec 7, 2006 IP

mad4 Peon

Messages:: 6,986

Likes Received:: 493

Best Answers:: 0

Trophy Points:: 0

#3

Thanks. I'm sure Google can't have grabbed the urls before they were blocked.

mad4, Dec 7, 2006 IP

mad4 Peon

Messages:: 6,986

Likes Received:: 493

Best Answers:: 0

Trophy Points:: 0

#4

Update: I uploaded my existing robots.txt file to http://services.google.com:8882/urlconsole/controller

Will see what happens.

mad4, Dec 7, 2006 IP

hans Well-Known Member

Messages:: 2,923

Likes Received:: 126

Best Answers:: 1

Trophy Points:: 173

#5

that actually is a pretty old issue with Google
entire folders excluded by robots.txt still show up at least as URLs when searching G - for example MANY years excluded cgi-bin and others as well.

the URL may still and forever exists in G index but most likely the page itself may never appear in G cache

to REMOVE URLs from G index the remove URL procedure at G needs to be used BUT the remove URL procedure at G requires a 404 being served to a request!

other excluded URL reapear again and again as long as other web sites in the web have a link to that URL
G position on this is strict - I just recently had a mail exchange with G about NON-existing folders - G refused to remove the ten thousands of URLs giving a 404 because a half a dozen backlinks still exist out there ... a remove URL procedure in such cases brings a temporary solution and a few months later that none- existing folder will reappear again ( did so this year ) as long as a backlink exist to that non-existing file or folder

hans, Dec 7, 2006 IP

mad4 Peon

Messages:: 6,986

Likes Received:: 493

Best Answers:: 0

Trophy Points:: 0

#6

Having done some research this is an old issue, just never happened to me before.

To summarise: If you block Google access to certain url's it will not crawl them. This apparently does not mean then won't be indexed. Google will index them with as much information as it knows. In most cases this is just the url but it can also be DMOZ titles if its the main site.

Seems pretty stupid to me.

mad4, Dec 7, 2006 IP

hans Well-Known Member

Messages:: 2,923

Likes Received:: 126

Best Answers:: 1

Trophy Points:: 173

#7

I agree mad4
its stupid to site owners

but it is a smart GIFT TO hackers from G
all they often need is to find URLs to certain tools / installed SW names and G successfully offers such info to all hackers ( I had early this year repeated hacker intrusions as a direct result of such robots.txt-excluded URL originating from G search )

the MORE URL a SE has the higher the stock value of that SE I guess and that might be all that matters to G as well ... $

hans, Dec 7, 2006 IP

mad4 Peon

Messages:: 6,986

Likes Received:: 493

Best Answers:: 0

Trophy Points:: 0

#8

Ok, here is an example.
http://www.google.co.uk/search?hl=e...gitalpoint.com/newreply.php&btnG=Search&meta=

http://forums.digitalpoint.com/robots.txt contains Disallow: /newreply.php

The anchor text "view the article on its blog" comes from http://www.cocomment.com/blog/34675 which is the only page linking to http://forums.digitalpoint.com/newreply.php

mad4, Dec 7, 2006 IP

Log in or Sign up

robots.txt issues

mad4 Peon

dilute Peon

mad4 Peon

mad4 Peon

hans Well-Known Member

mad4 Peon

hans Well-Known Member

mad4 Peon

Useful Searches