I have seen, over the past few weeks, a lot of large sites lose the majority of their indexed pages. This caused dramatic drops in traffic, and widespread panic among site owners. Many assumed it was a Panda penalty. I didn't believe this at first, because Google said Panda is about deprioritizing low content, not deindexing vast swathes of the web, but the evidence was overwhelming. I have now noticed a lot of these sites starting to get their deindexed pages back, quite rapidly. Here's a theory, from an engineer's perspective. For some reason - probably Panda - Google had to change the way they database the data about sites in their index. Maybe they added new parameters, or rewired the database completely. They've done this a number of times in the past (and with that I don't mean algorithm changes but system design changes). To store site data in the new format, with new parameters or signals or whatever, they had to pull the old data out first, thus causing massive deindexing and temporary drops in traffic. Google are now starting to reindex the pages, this time incorporating the new signals they weren't getting before. This is nothing but wild speculation, and even if it's correct, probably oversimplifies a highly complex issue. But it would explain why a lot of large sites lost pages lately, and why they are starting to come back. Keep an eye on site:alexa.com, down to about 450,000 right now. This looks like a massive penalty on a powerful authority site, and it has caused a noticeable drop in traffic. But I'm guessing the number will keep coming down, and then surge back to what it's supposed to be (about 12 million). Of course, getting pages indexed is one thing, getting those pages to rank well is another.
alexa indexed pages are 115000 checked from Pakistan. i am also facing massive drop in indexed pages. from 300k to 30k in just 1 month. complete directory is removed with 5k pages.
well de-prioritizing me become synonymous with the indexing because if Google decides that the data is crap then why store it. I think this is actually a long overdue update for Google. For the last 4 to 5 years Google has been flooded with tidal wave after tidal wave of mass spam articles directory submissions blogs and everybody else that can get their hands on a computer have been pumping the Internet full of crap trash and useless content just so they can drop a link to some sites selling Viagra or some other trash. With the current state of the US economy a lot of people are turning to work from home methods and they find blogging and social vomiting of the mouth and easy thing to do. So when you added up over the last five years Google is full of crap content using up space on their servers. Even mighty Google is going to draw the line at some point to where it will not index or keep on record content it considers to be trash.
dude i totally agree with the sentiment, and crappy content and social vomiting isn't half of it, these days i can barely tell the difference between auto-generated garbage and human-generated garbage... but it seems google do want to store all the trash perhaps because it MIGHT just might contain something useful someone will search for specifically wth a long-tail query. Google have always been obsessive about crawling and indexing. i'm not really surprised spam heaps like informe.com - the king of all spam sites - are starting to come back, at least as far as indexing goes. it's already up to 11 million, less than the 70 million it used to have but still an improvement, can't remember exactly how low they fell but i'm pretty sure it was less than a million. of course, my theory could be completely wrong. alternative theories a. panda is being rolled back b. a google server farm went up in flames c. deindexing was caused by an inadvertent bug d. ???
i checked the alexa again in google with "site" April 3rd, 255000 5th April 577000 they are coming back but i am not ;-)