Has anyone been testing this filter? What does the filter use to determine duplicate pages? Are there simple ways to beat the filter?
Interesting terminology. Did this filter thing started in the SEO communities? All that it is to know about duplicate content is in the two patents Google has about dup content: Detecting duplicate and near-duplicate files Detecting query-specific duplicate documents If anyone needs help understanding them, I'll be happy to help.
Well, that's a wee bit more complex... You see... this Internet thingy existed before Google came 'round. Actually, this Internet thingy existed before the world wide web. And, back in those dark ages, before HTML existed, we allowed each other to copy what we wrote and store it on FTP servers. Then we updated to Gopher. Eventually, we learned HTML and carried that philosophy to the world wide web. The unpleasant(?) side effect is that, after a recent domain name change, one of my mirrors is now knocking me out of the SERPS for quite a few of my (our?) pages. It really shouldn't bother me. It really shouldn't. I really shouldn't care whether the users are looking at my content on my server or on one of the mirrors. I dunno. It's bugging me. But... not enough to change the way we have been working since before the web was invented.
Spencer, read the patents. When there's duplicate content in the SERPs, Google shows the one page that it thinks is best (the one with the highest PageRank). The problem with duplicate documents is that Google might decide to crawl them very infrequently and that way, your mirrors will outrank the main pages for a longer time than you would want to.
nohaber, you are correct! Googlebot visits every night, but I just checked and found that the set of mirrored pages where I am not winning the (friendly) duplicate content war are not being visited by Googlebot. I used to believe that Google chose the duplicate with the higher PR. However, Google seems to have chosen randomly between me and my #1 mirror. I win some pages and he wins others. PR distribution should be a lot more even than that. Right now, I'm not sure what to believe on that point. Ah well, all of the (current) mirrors are also mirroring my ads.
Will gooogle looks at pages not sites, so your saying that G is choosing some pages off your site & some off the mirror is right. Keep in mind that the toolbar PR is not the actual PR, it is PR rounded to a whole number. Your pages are likely to be linked to individually and this could give it an edge over another.
I'm wandering off-topic, but... I don't need a mirror. Mirrors were important in the late 80's and early 90's, but today their function is largely performed by Google cache and Archive.org's WayBackMachine. However, people like to mirror and I agreed to this arrangement years ago. I'm not going to back out now because of some silly search engine algorithm.
It amazes me how many webmasters believe that Google would reveal portions of their ranking methods by filing patent applications that would never be enforceable, *IF* the patents are granted. Oh well, we believe what we want to believe. Bompa
If the content is of a static nature, how about placing something dynamic (a few lines of randomly selected text, rss etc - or even manualy editing something) on the pages you prefer Google to look at? Possibly the pages that Google sees as more recently updated might change its mind? Just a thought.
Ah yes... I have been doing that. I have three different sets of server-side dynamic content on the primary site which do not appear on the mirror sites. Unfortunately, this has no effect. Well, perhaps unfortunately. Really, only one of these sets of pages should be showing up in the index. I currently have Googlebot banned from the mirror site. This is unfortunate, because almost every keyword from the primary site was dropped several pages in the SERPS with the arrival of Jager1. I allowed Googlebot back to the mirror site for awhile, and it did reasonably well in the SERPS. I've disallowed Googlebot again due to administrative/security issues on the mirrored site. So now the mirror gets almost no traffic and the main site gets little more. Thankfully, my #2 (unrelated) site has more than doubled in revenue in the last two months.
I have the same problem - and have had to "train" clients to write unique text to ensure that the listing isn't picked up as duplicate content. This is a tiresome job, but has solved what was a major issue on one of my websites.
The point being this: The duplicate content filter is not per-page; it is per-paragraph or even per-sentence.
I just did an experiment. I went to a re-use article site. In the SEO section I sorted by oldest and picked something in the middle with a unique title. I copied & pasted the title into G. It came up with 300+ sites. I scanned the results and they are links to sites with that article for the greater part. Have I missed the point or does my experiment refute your claim? (Honestly I do not know)
Will an interesting discussion is going on at WMW about this http://www.webmasterworld.com/forum30/32129-98-10.htm