I have a problem with my website and am looking for advice on the quickest way to resolve. Allow me to explain the issue and what I've done so far. The website mydomain.com currently has over 60000 pages in Google's index. The problem is that there are 50,000 pages that don't belong and that I don't want. The objective is how to get them removed as quickly as possible. The reason I don't want them is because 32000 of them are unrelated/off-topic to the theme of the website and all have the same Title and Description. The other 18000 are related but also have the same Title & Description. These url's (as far as I can tell) were discovered through two Search boxes by Googlebot submitting random queries and spidering the results (stupid bot, let Matt know how stupid by sending him this post). The problem URL's are www.mydomain.com/folder/page.htm?* (32000) and www.mydomain.com/page.php?* (18000). Here's what I've done so far to try to get them removed: For the 32000 URL's: 1. Added Disallow: /folder/page.htm?* in robots.txt to deny crawling. 2. Setup 301 from www.mydomain.com/folder/page.htm?* to www.mydomain.com/folder/results/page.htm?* so that I can block the new directory in robots for future crawls. 3. Added Disallow: /folder/results/ to robots.txt to deny future crawling. 4. Requested URL Removal for directory /folder/results/ in WMT. (this should be useless since these new urls's are not in the index). 5. Added meta robots noindex, nofollow, noarchive in html. The noarchive was added in hopes that it will remove the cached copy faster. The ideal solution I was looking for was to ask WMT to remove URL's using a wild card. For ex: Remove /folder/page.htm?*, but such an option does not exist. Now I'm thinking I should remove action 1 because it may never perform the 301, and therefore never update to the new blocked url, or update the cached copy. Then there's action 2 that could be removed since I also added noindex, therefore not needing to update the 301 anymore and possibly causing more delays. Action 3 would then be useless. Hmmm...I just went ahead and removed actions 1 & 2, making only 5 effective. For the 18000 URL's: 1. Added Disallow: /page.php?* in robots.txt to deny crawling. 2. Setup 301 from www.mydomain.com/page.php?* to www.mydomain.com/old-search/results.html so that I can block the new directory in robots for future crawls. 3. Added Disallow: /old-search/ to robots.txt to deny future crawling. Didn't bother with URL removal in WMT for a directory, since it doesn't even exist yet. I am not able to modify the html meta to add noindex since it is part of a cms that uses a global setting. The next best thing I could come up with is to remove action 1 and then create an html sitemap with all the old 18000 url's so that google can update it's index (that would be crazy). Also, this might not even work since action 3 may prevent it. Even if it did, the caches will probably remain for a while. Hmmm...I guess I should probably remove action 1 because googlebot may not be able to see that those url's are now redirected. Ok, I just did that. Have I done the best thing I could do for the 32000? What about the 18000? So now that I've explained what the issue is (I hope it was clear), does anyone have any solution to offer that will allow the quick removal of these url's. What else can I do for either or? Any guesses you have are welcome. BTW: This is kind of urgent because I have dropped to position 2 for my main term that I have maintained at 1 for over a year now. And I believe it may be because my site profile has been compromised with 80% of the stucture being garbage and 60% off-topic. And it's all because of Googlebot looking for more data that it shouldn't be looking for.
"I am not able to modify the html meta to add noindex since it is part of a cms that uses a global setting." If the CMS you are using is not encrypted on the source level, and as it seems to be written in PHP... Then find out where is this /page.htm?* file exist on your server download it and start a normal text editor... Locate the (include) tag for the global setting of the meta tags (which should be on the first of the page code) And remove it to force this particular page to use the (noindex, nofollow, noarchive tags) in plain HTML. Hope I was clear and it might be a good idea !? REP is most appreciated.
Yes, the cms is open source (joomla). The actual page is /index.php and I think this page controls all other pages. The problem urls in google are /index.php?option=com_search*. Right now, these pages are all being redirected to /search/results.html and are blocked by robots. Will adding the noindex to the pages make a difference? Would it speed up the removal from the google cache? BTW: You're code tip was clear, but just so you know, I'm not a programmer so I wouldn't be able to do this on my own, unless spelled out with the syntax. Either way, I would just pass it on the the webmaster.
"Will adding the noindex to the pages make a difference? Would it speed up the removal from the google cache?" Yes it will help you speed the thing up a little bit, And while you will pass the work to the webmaster... I have a suggestion that might help you even more: While you are redirecting the search results to a new URL instep to get those pages out of Google index... Why don't you consider using some URL Rewriting features and try to benefit out of all those pages: ex: domain.com/search/what-ever-you-was-searching-for.html And pass the same search query to the title of the page, and so on... Reference for the webmaster: you can use mod_rewrite. Because if i was you, i would never get rid out of the 50,000 pages indexed in Google, otherwise I will try to make each page get me a visitor per day = 1,500,000 Visitors per month JUST THINK ABOUT IT !!?
Normally, I too would want 50000 pages to remain in the index. Although in this case, they are useless, supplemental, and would not rank for anything. So I'll just have to settle with my current 500,000 visitors per month Anyone else out there with ideas on how to get the google indexed cleared up quickly?