In light of recent mayday events etc and the need to start sending more focused URLs to google for indexing, we have been reviewing our site structure and came to idenify over 25k of URLs we no longer wish it to access. to get a complete picture, think of it as an ecommerce site which has browsing by multiple categories, with filters by brand, product size, product price etc. in essence, you can have a urls like /Nike_ACG_Trail_Shoes_XL_50-100__04/ /Nike_ACG_Trail_Shoes_XXL_50-100__04/ /Nike_ACG_Trail_Shoes_L_50-100__04/ /Nike_ACG_Trail_Shoes_XL_100-150__04/ etc etc. multiplied by all brands, categories and possible sizes for footwear, clothing etc and .eu / .us standards, this produces A LOT of possible urls, all listings products that are essentially available upstream the funnel in the list for all nike products or all nike trail shoes. i have now applied a link rel="canonical" for any such iterations that goes to the '/Nike_ACG_TRail_Shoes/' category but the fact remains, over the course of may alone, google has accessed 26k of these 'version' URLs that we'd like removed. the question is how would you go about removing them from the index NOW. google webmaster tools offers you a way of submitting a url one at a time, clearly NOT an option. having a robots.txt file disallowment, would that work? whats the maximum size of the robots.txt file it can parse w/o dying? if put in there, and once de-indexed, do we need to keep them in the file for later (given that no links to such urls will ever be without a nofollow and the urls themselves have the canonical link to a single parent)? whats the best course of action here?
I would just create a new sitemap and submit it to google. It will take some time but they clean the results. Its happened to me boefore and took google about two weeks to update the results with the new current url's. best of luck
Chk out the article by GOOGLE : http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=164734