I have 2 sites for one of my companies that are different from each other, and no duplicate content issues. Submitted sitemap 6 months ago and one site shows 14,500 pages indexed, but there are only 2,200 pages in the site (7 year old). The other site has 9 pages indexed out of 1,800 (4 year old). Of the 9 pages, 8 are supplimental and only the home page is indexed properly. I reported the overindexing to Google about 2 months ago and within 48 hours they reduced the amount of indexed pages to 500 pages. I promote the site with more pages a lot more than the other and it gets good rankings. The other site is just a paranoid 'just in case' site that only gets promoted when things are slow. The 14,000 indexed pages site is considered an authority site and has been a page 1 site for 6.5 years (except for a few weeks here and there during a Google fart). I know of the site: problems which explains the other site, but I keep getting overindexed on the other. Anyone have this, or any explaination.
Are you certain that you do not have a session ID problem with the site? I've seen 100 page sites generate over 1000 URLs due to a session ID showing up in the URL. A session ID should never show up in a URL. It creates multiple URLs representing a page, which in turn creates a duplicate content penalty and the pages get tossed into supplemental results. If that's not the problem, then it is likely a Big Daddy issue.
No session id's. The BDP has been referenced to spam sites. There is nothing spammy about my site. I have gone over everything and there is no similarity between my site and the site with 5 billion pages indexed. The site is PR5 with 42 BL (Google's numbers). Very careful relavent linking. Definitely a BD issue, but I can't figure out why it keeps happening to this site.
Yes. Every page is unique, proper use of code. I do not submit my site to just anywhere either. The ranking for this site is great, so I suppose I should not care so much. But the site query problem associated with BD is deindexing, and the BDP example that I have seen does not seem to apply.
I refenced a "bad data push" since it pointed to the "bad data" as being the number of pages being shown to be in the index. IMO it nothing to do with spam. If you think about it for a moment, the term was in no way used to describe the spam, just the number of pages being shown to exist. It was a "generic" term used to describe the error (# of pages) being shown by the site: operator. Everyone automatically attributed the phrase to mean the SPAM being indexed. If I were to hazzard a W.A.G. The number being shown is compilation of all of Google's indexes including Base, Froogle and the supplemental index. If those are not filtered properly, a site could easily show 3 times the number of pages just based upon content that Google has stored multiple times for the same URL. The supplemental index, has stored duplicate content for the same URL for some time now. This is why pages appear to be being moved from the RI to the SI index when in all actuality the RI index isn't going live and the duplicate data for that URL in the SI is being shown. This is going to affect older domains since there's been more time to obtain data for both the RI and SI. There are also URL only results, when a page is found but not yet crawled for the first time, Google still has stored as well. Dave
I get the RI and SI movement. Out of the 14,000 pages, Google will only show around 1,000 if you take the results all the way to the end. I do have 2 domains that are older with indexing problems the other way. One has half pages indexed, and the other has 1 sup page in the index. But those are Google f****d up problems and that is common to many site owners. To have 1 site that has gotten over indexed a couple times now would seem to indicate a problem on my end. Otherwise it would have pages removed from the index like everyone else.
Not neccessarily. Google has problems with sites that they have not been having problems with nor should they be having problems with. The apparent over indexing is definitely a Google problem. I have several sites. One of which was consistantly in the SI but recently half of the pages returned to the RI without any changes on my part. The others have been fine. DAve
If this is a Google problem, and I will assume that it is, will the extreme quantity of pages trigger a duplicate filter. If I have approx 2,000 pages, and 14,000 in the site query, there would have to be some duplicate pages somewhere. Of those 14,000 pages, almost all are in the SI, so would the duplicate filter affect pages in the RI only or am I going to get filtered out?
The duplicate filter is going to "remove" pages from the index completely, not just move them. I wouldn't worry too much about that. I seriously doubt they use total pages as a threshold. By comparison, 2000 pages would not likely be considered an "extreme" number of pages. Dave
Dave, by extreme I meant having my site indexed 7 times over. Since the file structure has not changed in 6 years, I basically have 7 copies of my site indexed. Since the 4th level pages are probably not indexed as much, who knows how many versions of my main page they are counting. Hate to sound paranoid here, but is is possible I have a hundred copies of my main page in the SI. The site ranks well, so maybe I should just be amused. Think I should just leave it?
Anytime you flirt with "identical" content, you run the risk. You already know that. Given the current *supposition* that the deeper your navigation levels go, the less likely BD is to index the page, I'd tend to smile, and be amused. Rankings are what count. Dave