Google Duplicate Content Filter Algorithm

Discussion in 'Google' started by venetsian, Jan 11, 2007.

  1. #1
    Does anyone have a clue what is the approximate duplicate contect filter algorithm that google uses ?

    I wanted to test that out and put a new website full with "copied" articles about 2000 in total and I'm running some test to see if something is going to be put in the index and what is the approximate alogorithm for the duplicate content filter. I'm 100% sure that they are not checking for the whole text, but some portions .. or let's say about 50%...

    Anyone have any ideas how original - indexed content can be duplicated and in the same time be included in the google index?

    For now I strongly suggest that you don't try this experiment on your "good" web sites ... I'm just running some tests to see what's going to happen..

    Any ideas?? I'm waiting for results ..
     
    venetsian, Jan 11, 2007 IP
  2. GuyFromChicago

    GuyFromChicago Permanent Peon

    Messages:
    6,728
    Likes Received:
    529
    Best Answers:
    0
    Trophy Points:
    0
    #2
    GuyFromChicago, Jan 11, 2007 IP
  3. mad4

    mad4 Peon

    Messages:
    6,986
    Likes Received:
    493
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Duplicate content almost always gets indexed. It just doesn't always appear for certain queries. The number of times it gets filtered from the serps for certain queries is related to the percentage score probablity that the page is a duplicate compared with the trustrank and relevance of the domain.

    In short its totally untestable and not something to bother testing anyway.
     
    mad4, Jan 11, 2007 IP
  4. jakomo

    jakomo Well-Known Member

    Messages:
    4,262
    Likes Received:
    82
    Best Answers:
    0
    Trophy Points:
    138
    #4
    Hi!
    I am testing it, with 78% I got duplicated, with 20 it is ok for now.. I am looking for 40% and 60% :)

    Best regards,
    Jakomo
     
    jakomo, Jan 11, 2007 IP
  5. Bhartzer

    Bhartzer Peon

    Messages:
    65
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Absolutely not. It's totally testable. I believe that the same duplicate content filter/algorithm is being used in blogsearch. So, if you have content and you want to see if it will pass the duplicate content filter then post it in your blog or ping Google with it. If it shows up in blogsearch after a few minutes then it's not duplicate.

    I generally use the "25 percent rule". Pages, as a whole, need to be at least 25 percent different than any other page in order to pass the duplicate content filter/algorithm.

    That's true in my experience. But, I believe it's testable through Google blogsearch.
     
    Bhartzer, Jan 11, 2007 IP
  6. mad4

    mad4 Peon

    Messages:
    6,986
    Likes Received:
    493
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Saying the duplicate content filter is about indexing is ridiculous. Why would they need to filter the pages from the serps if they weren't indexed?
     
    mad4, Jan 11, 2007 IP
  7. comforteagle

    comforteagle Peon

    Messages:
    19
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #7
    You'll see your site disappear from the index with dupe content. Adsense likely canceled as well.
     
    comforteagle, Jan 11, 2007 IP
  8. venetsian

    venetsian Well-Known Member

    Messages:
    1,105
    Likes Received:
    61
    Best Answers:
    0
    Trophy Points:
    168
    #8
    Yes, the whole site was crawled but there is no page included in the google index.

    I'm going to try and see if more links to this page will make it pop out in the index.

    Did anyone tried to separate the content in blocks? I heard somewhere that this might work .. I'll try it out ..

    Venetsian.

    By the way ..

    Do you know if linking to such "full with duplicate contect site" might hurt the pages that link to it ??
     
    venetsian, Jan 11, 2007 IP
  9. windtalker

    windtalker Well-Known Member

    Messages:
    926
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    145
    #9
    Posters who say google won't index or will de-index a site if it is duplicate content is giving out wrong information. Google will index the site, but the site/page just will just not show up in a search result or will have a tough time to rank if a similar page with more authority is on the search result.
    Adding more links will work


    No, linking to it will not hurt if it is not a bad neighborhood site: porn, spam,etc
     
    windtalker, Jan 11, 2007 IP
  10. venetsian

    venetsian Well-Known Member

    Messages:
    1,105
    Likes Received:
    61
    Best Answers:
    0
    Trophy Points:
    168
    #10
    I like your responce .. make a lot of sence.

    I'll continue with the linking process and only time will say if this website will show up..

    other ideas?
     
    venetsian, Jan 11, 2007 IP
  11. Neale

    Neale Peon

    Messages:
    583
    Likes Received:
    19
    Best Answers:
    0
    Trophy Points:
    0
    #11
    Unique Content is King period.

    Anything else will get penalized in one way or another.
     
    Neale, Jan 11, 2007 IP
  12. 1EightT

    1EightT Guest

    Messages:
    2,646
    Likes Received:
    71
    Best Answers:
    0
    Trophy Points:
    0
    #12

    I've NEVER heard of anyone having an adsense account cancelled fro duplicate content. Please don't start rumors like that. New people will likely take it as fact when it DEFINITELY is not.
     
    1EightT, Jan 11, 2007 IP
  13. 1EightT

    1EightT Guest

    Messages:
    2,646
    Likes Received:
    71
    Best Answers:
    0
    Trophy Points:
    0
    #13
    That's why you run the content through algorythms that makes it unique ;)
     
    1EightT, Jan 11, 2007 IP
  14. Neale

    Neale Peon

    Messages:
    583
    Likes Received:
    19
    Best Answers:
    0
    Trophy Points:
    0
    #14
    That's why you run the content through algorythms that makes it unique

    bien sur ;)
     
    Neale, Jan 11, 2007 IP
  15. venetsian

    venetsian Well-Known Member

    Messages:
    1,105
    Likes Received:
    61
    Best Answers:
    0
    Trophy Points:
    168
    #15
    Can you show me some samples of that algorithm or at least input+output?

    I'm curious .. I've heard about that but I've never actually seen it myself.

    Please send link.

    Cheers,

    Venetsian.
     
    venetsian, Jan 11, 2007 IP
  16. Neale

    Neale Peon

    Messages:
    583
    Likes Received:
    19
    Best Answers:
    0
    Trophy Points:
    0
    #16
    Try Copyscape that will give you an absouloute response ;)
     
    Neale, Jan 11, 2007 IP
  17. venetsian

    venetsian Well-Known Member

    Messages:
    1,105
    Likes Received:
    61
    Best Answers:
    0
    Trophy Points:
    168
    #17
    Ok .. as far as I understood is that there is program that can make "duplacate content" not duplicate? I don't care about duplicate contect search.... that's easy to find.
     
    venetsian, Jan 11, 2007 IP
  18. thegypsy

    thegypsy Peon

    Messages:
    1,348
    Likes Received:
    109
    Best Answers:
    0
    Trophy Points:
    0
    #18
    I don't know if a % approach would work these days with G being big on Phrase based I/R processes for both Dupes and SPam.....
     
    thegypsy, Jan 11, 2007 IP
  19. 1EightT

    1EightT Guest

    Messages:
    2,646
    Likes Received:
    71
    Best Answers:
    0
    Trophy Points:
    0
    #19
    Percentage change is definitely not the way to go about looking at it. Google for example is working on the ability to look at a page, gather the general theme of it, then search for relevant phrases it expects to be found with that content. The closer you get to the statistical norm for that target phrase, the higher you are ranked for it (other off page factors apply of course).

    for examples of algorithms that can make content unique check my signature or do a google search for markov chains. It's simple, but quite effective. People will complain about the readability, but computers aren't smart enough to read and comprehend text, they just look at chains of words and phrases.
     
    1EightT, Jan 19, 2007 IP
  20. thegypsy

    thegypsy Peon

    Messages:
    1,348
    Likes Received:
    109
    Best Answers:
    0
    Trophy Points:
    0
    #20
    Hey 1/8 - I did a sh_t load of research into duplicate content .... and check out this stuff on Phrase Based indexing and retrieval ( yummy resource links at the bottom)

    There us ever growing evidence on the phrase based stuff.. at the bottom check out the link to the patent on the 'similarity engine' - very interesting... Oh and there is 'detecting Spam in a phrase based I/R system' ... not bad reading either

    L8TR
     
    thegypsy, Jan 19, 2007 IP