Google is indexing urls that I have blocked using robots.txt. If I use Disallow: /page.php should that stop page.php?id=123 as well. Its not working. Google sitemaps says "URLs restricted by robots.txt 173" and yet all these 173 pages are indexed. Anyone else noticing this? From Google
If they've been indexed once when they weren't mentioned in robots.txt it takes a long time for them to be removed and they resurface occasionally. Use the google form to request removal of that page. Also, you may want to try Disallow: /page.php? which google uses in its own robots.txt or even /page.php?*
Update: I uploaded my existing robots.txt file to http://services.google.com:8882/urlconsole/controller Will see what happens.
that actually is a pretty old issue with Google entire folders excluded by robots.txt still show up at least as URLs when searching G - for example MANY years excluded cgi-bin and others as well. the URL may still and forever exists in G index but most likely the page itself may never appear in G cache to REMOVE URLs from G index the remove URL procedure at G needs to be used BUT the remove URL procedure at G requires a 404 being served to a request! other excluded URL reapear again and again as long as other web sites in the web have a link to that URL G position on this is strict - I just recently had a mail exchange with G about NON-existing folders - G refused to remove the ten thousands of URLs giving a 404 because a half a dozen backlinks still exist out there ... a remove URL procedure in such cases brings a temporary solution and a few months later that none- existing folder will reappear again ( did so this year ) as long as a backlink exist to that non-existing file or folder
Having done some research this is an old issue, just never happened to me before. To summarise: If you block Google access to certain url's it will not crawl them. This apparently does not mean then won't be indexed. Google will index them with as much information as it knows. In most cases this is just the url but it can also be DMOZ titles if its the main site. Seems pretty stupid to me.
I agree mad4 its stupid to site owners but it is a smart GIFT TO hackers from G all they often need is to find URLs to certain tools / installed SW names and G successfully offers such info to all hackers ( I had early this year repeated hacker intrusions as a direct result of such robots.txt-excluded URL originating from G search ) the MORE URL a SE has the higher the stock value of that SE I guess and that might be all that matters to G as well ... $
Ok, here is an example. http://www.google.co.uk/search?hl=e...gitalpoint.com/newreply.php&btnG=Search&meta= http://forums.digitalpoint.com/robots.txt contains Disallow: /newreply.php The anchor text "view the article on its blog" comes from http://www.cocomment.com/blog/34675 which is the only page linking to http://forums.digitalpoint.com/newreply.php