Does anyone have a clue what is the approximate duplicate contect filter algorithm that google uses ? I wanted to test that out and put a new website full with "copied" articles about 2000 in total and I'm running some test to see if something is going to be put in the index and what is the approximate alogorithm for the duplicate content filter. I'm 100% sure that they are not checking for the whole text, but some portions .. or let's say about 50%... Anyone have any ideas how original - indexed content can be duplicated and in the same time be included in the google index? For now I strongly suggest that you don't try this experiment on your "good" web sites ... I'm just running some tests to see what's going to happen.. Any ideas?? I'm waiting for results ..
"Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar." That's about as detailed as G will get. http://googlewebmastercentral.blogspot.com/2006/12/deftly-dealing-with-duplicate-content.html
Duplicate content almost always gets indexed. It just doesn't always appear for certain queries. The number of times it gets filtered from the serps for certain queries is related to the percentage score probablity that the page is a duplicate compared with the trustrank and relevance of the domain. In short its totally untestable and not something to bother testing anyway.
Hi! I am testing it, with 78% I got duplicated, with 20 it is ok for now.. I am looking for 40% and 60% Best regards, Jakomo
Absolutely not. It's totally testable. I believe that the same duplicate content filter/algorithm is being used in blogsearch. So, if you have content and you want to see if it will pass the duplicate content filter then post it in your blog or ping Google with it. If it shows up in blogsearch after a few minutes then it's not duplicate. I generally use the "25 percent rule". Pages, as a whole, need to be at least 25 percent different than any other page in order to pass the duplicate content filter/algorithm. That's true in my experience. But, I believe it's testable through Google blogsearch.
Saying the duplicate content filter is about indexing is ridiculous. Why would they need to filter the pages from the serps if they weren't indexed?
Yes, the whole site was crawled but there is no page included in the google index. I'm going to try and see if more links to this page will make it pop out in the index. Did anyone tried to separate the content in blocks? I heard somewhere that this might work .. I'll try it out .. Venetsian. By the way .. Do you know if linking to such "full with duplicate contect site" might hurt the pages that link to it ??
Posters who say google won't index or will de-index a site if it is duplicate content is giving out wrong information. Google will index the site, but the site/page just will just not show up in a search result or will have a tough time to rank if a similar page with more authority is on the search result. Adding more links will work No, linking to it will not hurt if it is not a bad neighborhood site: porn, spam,etc
I like your responce .. make a lot of sence. I'll continue with the linking process and only time will say if this website will show up.. other ideas?
I've NEVER heard of anyone having an adsense account cancelled fro duplicate content. Please don't start rumors like that. New people will likely take it as fact when it DEFINITELY is not.
Can you show me some samples of that algorithm or at least input+output? I'm curious .. I've heard about that but I've never actually seen it myself. Please send link. Cheers, Venetsian.
Ok .. as far as I understood is that there is program that can make "duplacate content" not duplicate? I don't care about duplicate contect search.... that's easy to find.
I don't know if a % approach would work these days with G being big on Phrase based I/R processes for both Dupes and SPam.....
Percentage change is definitely not the way to go about looking at it. Google for example is working on the ability to look at a page, gather the general theme of it, then search for relevant phrases it expects to be found with that content. The closer you get to the statistical norm for that target phrase, the higher you are ranked for it (other off page factors apply of course). for examples of algorithms that can make content unique check my signature or do a google search for markov chains. It's simple, but quite effective. People will complain about the readability, but computers aren't smart enough to read and comprehend text, they just look at chains of words and phrases.
Hey 1/8 - I did a sh_t load of research into duplicate content .... and check out this stuff on Phrase Based indexing and retrieval ( yummy resource links at the bottom) There us ever growing evidence on the phrase based stuff.. at the bottom check out the link to the patent on the 'similarity engine' - very interesting... Oh and there is 'detecting Spam in a phrase based I/R system' ... not bad reading either L8TR