I have a load of pages that are in the supplemental index. Mostly they are duplicate in some way (eg printable version). I want to de-index them so that the ratio of non-supp:supp is better. I have blocked them with robots.txt, but they are still in the index. From what I know there are 2 theories: 1- They will be removed from the index automagically because they are blocked in robots 2- You have to replace the pages you want de-indexed with a blank. (OK I can redirect and change the links). 3- (Use the google webmaster tools. I don't want to do that because there are hundreds of them.) Which is right? I'm trying number 1 at the moment, but the count isn't going down - caches of deleted pages can hang around in the index for years, right? Thanks
Hey this is not something to be worried so much but the following comment by Matt Cuts can be useful for you: "supplemental results aren’t something to be afraid of; I’ve got pages from my site in the supplemental results, for example. A complete software rewrite of the infrastructure for supplemental results launched in Summer o’ 2005, and the supplemental results continue to get fresher. Having urls in the supplemental results doesn’t mean that you have some sort of penalty at all; the main determinant of whether a url is in our main web index or in the supplemental index is PageRank. If you used to have pages in our main web index and now they’re in the supplemental results, a good hypothesis is that we might not be counting links to your pages with the same weight as we have in the past. The approach I’d recommend in that case is to use solid white-hat SEO to get high-quality links (e.g. editorially given by other sites on the basis of merit I think going forward, you’ll continue to see the supplemental results get even fresher, and website owners may see more traffic from their supplemental results pages. To check out the current freshness of the supplemental results, I grabbed 20 supplemental pages from my site and checked out their crawl date using the “cache:†command and looking in the cached page header. The oldest supplemental results page that I saw was from September 7th, 2006 (and I only saw 2-3 pages from September; most were from December or November). The most recent of the 20 pages was from January 7, 2007, which shows that supplemental results can be quite fresh at this point."
As far as I know, putting pages to de-index in robots.txt is the right way. Will take forever anyway. I have pages that are not anymore on my site since 2006, and they are still in the index. I tried to use the google removal tool, but magically after six month they come back and google tools tells me there are 404 errors (Hell yeah! I'm telling you I removed thoose: what kind of language you understand? klingon?) Kind of a lost battle.