Is there such thing as a "new site penalty" that might cause Google behavior described below? Specifically, there is a new (about month and a half old) site containing information about widgets. It is a niche, database-driven site with millions of unique pages (large, established industrial manufacturer just now getting online) describing highly demanded (by other manufacturers) widgets. Most if not all the pages are unique (pages are, at the most, 55% alike; these 55% likeness is all due to an omni-present and rather intense navigation bar; each page has a unique metatags as there is no shortage of content line) and have a very clear text-linking structure. The site also has a few links from established (pr5) sites and the new links keep coming at a slow pace. Site has Google sitemaps, does not employ sub-domains, has static SE-friendly URLs for every page, every page and CSS validate without a single warning, has proper 301 and 304 redirects, the domain never was registered before and never changed owner since it was registered. The site currently has PR0. Basically, a site is designed with all *known* SE and usability guidelines in mind. The owner will not risk any questionable SEO techniques as search engines are considered a valuable source of traffic given the size of the company production line and web site. When the site went live it was visited and indexed by Google within 3 days. All the pages from the first level (linked from the homepage) are in the index and are showing up in SERPs from day one in the index. The interesting thing is that GoogleBot is visiting the site regularly (50000 unique pages on average daily) and visited about a million unique pages (200, not 304s in the logs) so far, but never included any of those pages in the index (aside from those that are linked directly of the homepage and that were in the index on the 3rd day). None of those pages that are in the index are listed as supplemental results. The question is whether there is some aging filter or “new site†filter that precludes Google from adding the pages it crawls to the index? I just can’t think of anything else what might be wrong with the site as it is an honest, properly designed informational site with easy navigation and clear code. Thoughts?
There is no aging filter on getting added to the index. (Ranking for competitive keywords is another story.) A site with millions of pages is unusual (especially being new) and I think it is just going to take some time.
That's a fantasticly detailed and well written post. Top marks! The effect you are thinking of is the "Sandbox", where new sites are put on probation for a period of time while Google checks it out for quality. There is a forum dedicated to this thing: Search Engine Optimization > Sandbox. If I'm wrong and it isn't the Sandbox effecting you, then perhaps you just need to wait a little longer. Also note that Page Rank is only updated periodically, so don't expect to see it change right away. I think the next update is in October. Keep working on those backlinks while you wait. Check to see which pages are actually in the index with the search term "site:www.domain.com". Cryo.
Not really, Mjewel. The site is based upon an old, established and very extensive database of widgets just now getting on the web, so the number of pages is representative of those items.
Thank you Cryo. I am not sure the "Sandbox" is the issue. The pages linked of the homepage (one click down) are in the SERPs from day one in the index and within a week occupied the top 3 positions for rather competitive keywords (over 1,000,000 results returned on average). It is the million pages crawled and none indexed from the lower level that I am curious about.
The sandbox has nothing to do with indexing of a site, it has to do with not counting backlinks when calculating SERPS for certain competitive keywords that google has indentified. "Competitive" is not how many results a google search returns, but it tied to number of actual searches. You can have keywords which a google search will return millions of results, but for which very few people search for. Using oveture (or the DP keyword suggestion tool) gives you an idea of how competitive a keyword is. MFA sites often target keywords that actually get searched for - not keywords that have a lot of search results. A keyword with a lot of search results isn't going to produce any income if no one is searching for it. Without knowing the specific url, it's impossible to know if there are potential problems with a bot being able to find all the pages. The "widgets" may be established, but a new domain that all of sudden has millions of pages isn't something that is common and it just takes time (assuming there are no other problems). I don't have any sites with millions of pages, but do a site with six-figures and it took a period of about 5-6 months to get all those pages added to the index. I still think your issues are likely related to the fact that you are talking about a site that is only 45 days old.