I posted this question in another forum, but maybe it was the wrong forum. Still trying to learn the mechanics behind Google, but it seems there are 3 different levels for an indexed page: 1.) Only the URL with no other info (nothing cached, no addition meta info, just the url link to the page 2.) 'Supplemental Results' (if you do a 'site:domain.com' query, some pages have their titles indexed and some additional meta info from the page, but no cache and a term called 'supplimental results' next to it 3.) Fully cached and indexed with everything in it now, i'm pretty sure i have seen pages go from step 1 to step 3. But what's up with step 2? is that a kind of pergatory for pages that Google will admit exist, but give no credence to? Or do all pages go slowly from one step to the other?
1. Typically this means Google knows about the URL (from a link to it), but has not spidered/indexed it yet. 2. This is Google's secondary index. Only Google knows for sure why some are there, but from what I've observed, in most cases, they have a cache date that is months old (so maybe when a page hasn't been respidered in a certain amount of time it moves there). So typically you will see pages go from 1 to 3 to 2 (if they never are reindexed).
how odd. i had a site not on the Digital Point that had all its pages created on the same day, googlebot did its thing and it seemed to create these two classes of pages immediately. some were a-class cached pages, and others were b-class supplemental pages.
I've actually seen a huge rise in the number of supplemental results from newer domains as of late. I know for sure certain errors can get you in this seperate index, but it certainly isn't the only thing. it sure seems to me like supplemental results seem to spring up when there are issues of duplicate content or not enough content variation between pages (just a few things changing, rather than full paragraphs of text, etc).
I know that many people will say that meta descriptions don't matter any more, but I've had under construction sites get large amounts of supplementaries when the pages have the same meta descriptions (or non at all). Change the descriptions so that they are unqiue and the pages became cached as 'normal' pages.
I think factors like duplication and relevance can possibly put pages in the supplemental index. I had some pages on a few sites that ended up in the supplemental index, which after making some changes to reduce duplication from other less relevant pages moved to the regular index. Obviously nothing scientific about it, but all these pages were less than 6 weeks old at the time.
ok, so it doesn't necessarily have to be a permanent state. weird. i've been looking on other boards, and nobody really has a good answer for what this is. i read the google reply, but that was far from useful. thanks
Supplemental results to me seem to be pages that 1. haven't been cached in a long time and 2. have no internal links or external. For example a page that you no longer link to from your site such as a links.php that is no longer in use. Or a page that no longer exists but is still in the index seems to make it into supplementals as well. I've got some 50,000+ pages in supplemental due to a restructuring of one of my sites Pages no longer exist but are still cached, and moved to supplemental.
Google's major data structures related to your page are: 1. Google learns about the existence of your page. Your page gets in the list of URLs to crawl (crawling queue). The anchor text of the link may get in the index (probably depends on the quality of the link). 2. Google reads your page and puts it in the repository. The repository is NOT the index. Google may keep more than 1 version of your page in the repository. The snippets in the SERPs are from the repository, NOT the index. If your snippets shows your latest page version, it does not mean it is indexed. 3. Google puts your page in the index. The index is made up of hit lists for every word in the lexicon. Basically, a hit list for a word contains a list of all documents containing it and hit info like the position of the word, font size etc. The index contains on-page hits for only one version of your page. On-page hits are put in the index after your page gets crawled and put in the repository.