Google Duplicate Content Filter Algorithm

venetsian Well-Known Member

Messages:: 1,105

Likes Received:: 61

Best Answers:: 0

Trophy Points:: 168

#1

Does anyone have a clue what is the approximate duplicate contect filter algorithm that google uses ?

I wanted to test that out and put a new website full with "copied" articles about 2000 in total and I'm running some test to see if something is going to be put in the index and what is the approximate alogorithm for the duplicate content filter. I'm 100% sure that they are not checking for the whole text, but some portions .. or let's say about 50%...

Anyone have any ideas how original - indexed content can be duplicated and in the same time be included in the google index?

For now I strongly suggest that you don't try this experiment on your "good" web sites ... I'm just running some tests to see what's going to happen..

Any ideas?? I'm waiting for results ..

venetsian, Jan 11, 2007 IP

GuyFromChicago Permanent Peon

Messages:: 6,728

Likes Received:: 529

Best Answers:: 0

Trophy Points:: 0

#2

"Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar."

That's about as detailed as G will get.

http://googlewebmastercentral.blogspot.com/2006/12/deftly-dealing-with-duplicate-content.html

GuyFromChicago, Jan 11, 2007 IP

mad4 Peon

Messages:: 6,986

Likes Received:: 493

Best Answers:: 0

Trophy Points:: 0

#3

Duplicate content almost always gets indexed. It just doesn't always appear for certain queries. The number of times it gets filtered from the serps for certain queries is related to the percentage score probablity that the page is a duplicate compared with the trustrank and relevance of the domain.

In short its totally untestable and not something to bother testing anyway.

mad4, Jan 11, 2007 IP

jakomo Well-Known Member

Messages:: 4,262

Likes Received:: 82

Best Answers:: 0

Trophy Points:: 138

#4

Hi!
I am testing it, with 78% I got duplicated, with 20 it is ok for now.. I am looking for 40% and 60%

Best regards,
Jakomo

jakomo, Jan 11, 2007 IP

Bhartzer Peon

Messages:: 65

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#5

In short its totally untestable and not something to bother testing anyway.
Click to expand...

Absolutely not. It's totally testable. I believe that the same duplicate content filter/algorithm is being used in blogsearch. So, if you have content and you want to see if it will pass the duplicate content filter then post it in your blog or ping Google with it. If it shows up in blogsearch after a few minutes then it's not duplicate.

I generally use the "25 percent rule". Pages, as a whole, need to be at least 25 percent different than any other page in order to pass the duplicate content filter/algorithm.

related to the percentage score probablity that the page is a duplicate compared with the trustrank and relevance of the domain.
Click to expand...

That's true in my experience. But, I believe it's testable through Google blogsearch.

Bhartzer, Jan 11, 2007 IP

mad4 Peon

Messages:: 6,986

Likes Received:: 493

Best Answers:: 0

Trophy Points:: 0

#6

Saying the duplicate content filter is about indexing is ridiculous. Why would they need to filter the pages from the serps if they weren't indexed?

mad4, Jan 11, 2007 IP

comforteagle Peon

Messages:: 19

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#7

venetsian said: ↑

Does anyone have a clue what is the approximate duplicate contect filter algorithm that google uses ?

I wanted to test that out and put a new website full with "copied" articles about 2000 in total and I'm running some test to see if something is going to be put in the index and what is the approximate alogorithm for the duplicate content filter. I'm 100% sure that they are not checking for the whole text, but some portions .. or let's say about 50%...

Anyone have any ideas how original - indexed content can be duplicated and in the same time be included in the google index?

For now I strongly suggest that you don't try this experiment on your "good" web sites ... I'm just running some tests to see what's going to happen..

Any ideas?? I'm waiting for results ..
Click to expand...

You'll see your site disappear from the index with dupe content. Adsense likely canceled as well.

comforteagle, Jan 11, 2007 IP

venetsian Well-Known Member

Messages:: 1,105

Likes Received:: 61

Best Answers:: 0

Trophy Points:: 168

#8

Yes, the whole site was crawled but there is no page included in the google index.

I'm going to try and see if more links to this page will make it pop out in the index.

Did anyone tried to separate the content in blocks? I heard somewhere that this might work .. I'll try it out ..

Venetsian.

By the way ..

Do you know if linking to such "full with duplicate contect site" might hurt the pages that link to it ??

venetsian, Jan 11, 2007 IP

windtalker Well-Known Member

Messages:: 926

Likes Received:: 34

Best Answers:: 0

Trophy Points:: 145

#9

Posters who say google won't index or will de-index a site if it is duplicate content is giving out wrong information. Google will index the site, but the site/page just will just not show up in a search result or will have a tough time to rank if a similar page with more authority is on the search result.

venetsian said: ↑

I'm going to try and see if more links to this page will make it pop out in the index.
Click to expand...

Adding more links will work

venetsian said:

By the way ..

Do you know if linking to such "full with duplicate contect site" might hurt the pages that link to it ??
Click to expand...

No, linking to it will not hurt if it is not a bad neighborhood site: porn, spam,etc

windtalker, Jan 11, 2007 IP

venetsian Well-Known Member

Messages:: 1,105

Likes Received:: 61

Best Answers:: 0

Trophy Points:: 168

#10

I like your responce .. make a lot of sence.

I'll continue with the linking process and only time will say if this website will show up..

other ideas?

venetsian, Jan 11, 2007 IP

Neale Peon

Messages:: 583

Likes Received:: 19

Best Answers:: 0

Trophy Points:: 0

#11

Unique Content is King period.

Anything else will get penalized in one way or another.

Neale, Jan 11, 2007 IP

1EightT Guest

Messages:: 2,646

Likes Received:: 71

Best Answers:: 0

Trophy Points:: 0

#12

comforteagle said: ↑

You'll see your site disappear from the index with dupe content. Adsense likely canceled as well.
Click to expand...

I've NEVER heard of anyone having an adsense account cancelled fro duplicate content. Please don't start rumors like that. New people will likely take it as fact when it DEFINITELY is not.

1EightT, Jan 11, 2007 IP

1EightT Guest

Messages:: 2,646

Likes Received:: 71

Best Answers:: 0

Trophy Points:: 0

#13

Neale said: ↑

Unique Content is King period.

Anything else will get penalized in one way or another.
Click to expand...

That's why you run the content through algorythms that makes it unique

1EightT, Jan 11, 2007 IP

Neale Peon

Messages:: 583

Likes Received:: 19

Best Answers:: 0

Trophy Points:: 0

#14

That's why you run the content through algorythms that makes it unique

bien sur

Neale, Jan 11, 2007 IP

venetsian Well-Known Member

Messages:: 1,105

Likes Received:: 61

Best Answers:: 0

Trophy Points:: 168

#15

1EightT said: ↑

That's why you run the content through algorythms that makes it unique
Click to expand...

Can you show me some samples of that algorithm or at least input+output?

I'm curious .. I've heard about that but I've never actually seen it myself.

Please send link.

Cheers,

Venetsian.

venetsian, Jan 11, 2007 IP

Neale Peon

Messages:: 583

Likes Received:: 19

Best Answers:: 0

Trophy Points:: 0

#16

Try Copyscape that will give you an absouloute response

Neale, Jan 11, 2007 IP

venetsian Well-Known Member

Messages:: 1,105

Likes Received:: 61

Best Answers:: 0

Trophy Points:: 168

#17

Neale said: ↑

Try Copyscape that will give you an absouloute response
Click to expand...

Ok .. as far as I understood is that there is program that can make "duplacate content" not duplicate? I don't care about duplicate contect search.... that's easy to find.

venetsian, Jan 11, 2007 IP

thegypsy Peon

Messages:: 1,348

Likes Received:: 109

Best Answers:: 0

Trophy Points:: 0

#18

I don't know if a % approach would work these days with G being big on Phrase based I/R processes for both Dupes and SPam.....

thegypsy, Jan 11, 2007 IP

1EightT Guest

Messages:: 2,646

Likes Received:: 71

Best Answers:: 0

Trophy Points:: 0

#19

Percentage change is definitely not the way to go about looking at it. Google for example is working on the ability to look at a page, gather the general theme of it, then search for relevant phrases it expects to be found with that content. The closer you get to the statistical norm for that target phrase, the higher you are ranked for it (other off page factors apply of course).

for examples of algorithms that can make content unique check my signature or do a google search for markov chains. It's simple, but quite effective. People will complain about the readability, but computers aren't smart enough to read and comprehend text, they just look at chains of words and phrases.

1EightT, Jan 19, 2007 IP

thegypsy Peon

Messages:: 1,348

Likes Received:: 109

Best Answers:: 0

Trophy Points:: 0

#20

1EightT said: ↑

Percentage change is definitely not the way to go about looking at it. Google for example is working on the ability to look at a page, gather the general theme of it, then search for relevant phrases it expects to be found with that content. The closer you get to the statistical norm for that target phrase, the higher you are ranked for it (other off page factors apply of course).

for examples of algorithms that can make content unique check my signature or do a google search for markov chains. It's simple, but quite effective. People will complain about the readability, but computers aren't smart enough to read and comprehend text, they just look at chains of words and phrases.
Click to expand...

Hey 1/8 - I did a sh_t load of research into duplicate content .... and check out this stuff on Phrase Based indexing and retrieval ( yummy resource links at the bottom)

There us ever growing evidence on the phrase based stuff.. at the bottom check out the link to the patent on the 'similarity engine' - very interesting... Oh and there is 'detecting Spam in a phrase based I/R system' ... not bad reading either

L8TR

thegypsy, Jan 19, 2007 IP

Log in or Sign up

Google Duplicate Content Filter Algorithm

venetsian Well-Known Member

GuyFromChicago Permanent Peon

mad4 Peon

jakomo Well-Known Member

Bhartzer Peon

mad4 Peon

comforteagle Peon

venetsian Well-Known Member

windtalker Well-Known Member

venetsian Well-Known Member

Neale Peon

1EightT Guest

1EightT Guest

Neale Peon

venetsian Well-Known Member

Neale Peon

venetsian Well-Known Member

thegypsy Peon

1EightT Guest

thegypsy Peon

Useful Searches