Google's Duplicate Content Filter

Discussion in 'Google' started by Will.Spencer, Aug 8, 2004.

  1. webyp

    webyp Banned

    Messages:
    103
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #21
    I find the duplicate content filter is fairly light in the way it only affects pages with more than say 90% of matching content. This can be seen by searching for a news headline: there will often be many pages with a substantially duplicate content.
     
    webyp, Dec 10, 2005 IP
  2. Sham

    Sham Peon

    Messages:
    136
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #22
    Hmm what happens if you have pretty "loose" mod_rewrites?

    That is, a visitor can visit a page on my site with

    www.mysite.com/<anything>/12/index.htm

    So
    mysite.com/Apple/12/index.htm
    mysite.com/Pear/12/index.htm

    will be flagged as duplicate content on my site?
     
    Sham, Dec 12, 2005 IP
  3. codeteacher

    codeteacher Peon

    Messages:
    117
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #23
    How does google ever know which webmaster duplicated content ? It's not always a fact that the site that shows up with the content first in the google database is the original creator.
     
    codeteacher, Dec 13, 2005 IP
  4. Old Welsh Guy

    Old Welsh Guy Notable Member

    Messages:
    2,699
    Likes Received:
    291
    Best Answers:
    0
    Trophy Points:
    205
    #24

    Google will normally filter out the page with the lower page rank, it is this very thing that is exploited in one black hat trick.
     
    Old Welsh Guy, Dec 13, 2005 IP
  5. Bompa

    Bompa Active Member

    Messages:
    461
    Likes Received:
    20
    Best Answers:
    0
    Trophy Points:
    58
    #25

    Sure it is :)

    Whatever page is found first by Google will be the "original" page, well,
    that's what they are working on last I heard, dont know if that's in effect
    yet tho.

    Bompa
     
    Bompa, Dec 14, 2005 IP
  6. Old Welsh Guy

    Old Welsh Guy Notable Member

    Messages:
    2,699
    Likes Received:
    291
    Best Answers:
    0
    Trophy Points:
    205
    #26
    Bompa sadly that is not how it works. Google will always show what it considers to be the most relevant content for any search. If it considers two pages to be identical, then it will use pagerank as a tie breaker. The page with the highest page rank is the one that stays.

    It is a common trick to scrape a competitors page that is above you, then cloak it on your own site via IP cloaking. Google then sees two pages of identical content for that search . Page rank kicks in, and your competitor is thrown out of the SERPS. Of course this is black hat SE Positioning, it will get you banned, and your page MUST have a higher page rank than your competitors also.
     
    Old Welsh Guy, Dec 14, 2005 IP
  7. Bompa

    Bompa Active Member

    Messages:
    461
    Likes Received:
    20
    Best Answers:
    0
    Trophy Points:
    58
    #27

    Why would it get me banned? Maybe it really is my content.

    Now we are back to "How does Google know who is the real owner of
    the content?"

    You are saying that PageRank determines ownership and I am not arguing
    that.

    What I said is that they are working on a "first found" type of sytem.


    Bompa
     
    Bompa, Dec 14, 2005 IP
  8. Old Welsh Guy

    Old Welsh Guy Notable Member

    Messages:
    2,699
    Likes Received:
    291
    Best Answers:
    0
    Trophy Points:
    205
    #28
    Apologies Bompa, I really should have read the entire post first lol. As for why it should get you banned, it doesn't. The way that scum trick works is to get you filtered. Google filters duplicate pages all the time from their SERP's otherwise they would befilled with the same content.
     
    Old Welsh Guy, Dec 14, 2005 IP
  9. THanhNGuyen

    THanhNGuyen Peon

    Messages:
    23
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #29
    duplicate content ..I do not believe ...after testing these two sites:
    vietnamparadisetravel.com and travelsvietnam.com they have 94% same content but they still have very good rank on google: try to search with "vietnam travel" "vietnam tour" and I know that they do not have only 2 site but 3 or 4 sites (and one more: vietnamstopover.com) which have the same content and have good rank as well. I do a whois and found out that they host their site on different servers ... (different IP) so ..is it the problem for my 2 sites is...they are hosted on the same host, same server so google knows that and it only responses one of my site on search result but not both ....And I do not believe that google can know who owns 2 websites and only lists one of 2.

    any ideas?
     
    THanhNGuyen, Mar 5, 2006 IP
  10. Bompa

    Bompa Active Member

    Messages:
    461
    Likes Received:
    20
    Best Answers:
    0
    Trophy Points:
    58
    #30

    Hmmm, why is it that when I am searching for a perl function or
    info about mod_rewrite, or whatever, I get so many pages in the
    results that are all the "man" page? They are all dupes!

    Why are those not filtered out?


    Bompa
     
    Bompa, Mar 6, 2006 IP
  11. Old Welsh Guy

    Old Welsh Guy Notable Member

    Messages:
    2,699
    Likes Received:
    291
    Best Answers:
    0
    Trophy Points:
    205
    #31
    Applying filters is a resource hungry exercise. I and many others believe that Google operates a multi tier filtering system. The more commercial the phrase, the harder the filter This ensures they are not wasting resources on non competitive and non commercial phrases.

    You will always find exceptions to the rule, but rest assured there IS a duplication filter.

    Also kep in mind that when talking about duplicate pages, we are not talking about what the browser interperts the page as, we are tlaking about the underlying code. Google doesn't process code, it simply takes it. I could make 3 pages all appearing identical, but with totally different code. 1 in tables, 1 in CSS absolute positioning, and one with dynamic content. all look identical, but all 3 different.
     
    Old Welsh Guy, Mar 6, 2006 IP
  12. tonystai

    tonystai Peon

    Messages:
    14
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #32
    THanhNGuyen

    Yes, that is an easy catch for Google when they are hosted on the same domain. Host them on different domains. Of course if they think that they are the same/similar they may check your DNS registration and see that you own both domains as well.

    Of course this is only if they care that the keywords you are targeting are competitive enough to worry about.

    With the BigDaddy update, it will be interesting to see if they have increased their processing power enough to care all the time regardless of the competition factor.
     
    tonystai, Mar 6, 2006 IP
  13. tonystai

    tonystai Peon

    Messages:
    14
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #33
    Old Welsh Guy,

    I'm not sure I understand what you are trying to say in your last post - 3rd paragraph. Are you saying that google doesn't consider the 3 pages to be the same even thought the content is the same because they are coded differently?

    Because if you are, I would have to disagree to the extent that I think that Google is looking at the final content as it would be rendered to determine duplicate content not just the scanned source of the page as it is sent from the web server.
     
    tonystai, Mar 6, 2006 IP
  14. Old Welsh Guy

    Old Welsh Guy Notable Member

    Messages:
    2,699
    Likes Received:
    291
    Best Answers:
    0
    Trophy Points:
    205
    #34
    HI Tony, and what makes you think that the search engines are actually processing code and ranking final pages? Bearing in mind that they are not calling the elements within that page when spidering?

    I am not saying your wrong, I am genuinely asking the question.
     
    Old Welsh Guy, Mar 6, 2006 IP
  15. DomainMagnate

    DomainMagnate Illustrious Member

    Messages:
    10,932
    Likes Received:
    1,022
    Best Answers:
    0
    Trophy Points:
    455
    #35
    I think that what you guys might be looking for here..

    http://www.seroundtable.com/archives/003398.html

    There's quite a lot of interesting info there, so i can't just copy it all, better go there and read it if you haven't yet, very useful..
     
    DomainMagnate, Mar 6, 2006 IP
  16. THanhNGuyen

    THanhNGuyen Peon

    Messages:
    23
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #36
    Please have a look:
    have 2 websites about Travel and tourism, one is .com and the other is .co.uk My both sites appeared online 2 years ago, Now The .com has PR3 and the .co.uk has rank 4, both is listed by google. I must say that they have the same owner and hosted on the same server (same IP address) but they don't have the same content, just very fews pages have duplicate content around 80% (it's really rare). My problem is that both sites changing SERP for each other. For Example when I did search with keyword "vietnam holidays" (google Se) one site appeared in the page 2 with rank 4 for a week. when I did the same for the next week, The other site appeared with the same position and the site which appeared in last week had gone.

    All mean that only one site appears in google, not both of them. Maybe google filler the content and only list one site in result page. But I don't belieave that when researching my competitor's 3 sites. They have 3 website with aproximately 99% the same content but all 3sites are listed very well and have high PR, PR5 and have the good position on the first result page with the competitive keyword such as: vietnam travel, vietnam tour, vietnam vacation, vietnam holidays ...

    If you have experience, please post your Idear about this issue.
    My competitor's 3 sites that I mention to are:
    httpx://www.vietnamparadisetravel.com
    httpx://www.travelsvietnam.com
    httpx://www.vietnamstopover.com

    My .co.uk site is: 2www.vietnam-holidays.co.uk
    My .com site is: 2www.haivenu-vietnam.com
    Highly apreciate if you do review my 2 sites.

    Thanks
     
    THanhNGuyen, Mar 9, 2006 IP
  17. webgator

    webgator Well-Known Member

    Messages:
    122
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    130
    #37
    Wow talk about beating a dead horse!

    No one knows for certain how the dupe filter is applied. I know a few webmasters that have sites that are 95% articles drawn from article sites such as ezinearticles and they do very well in the se's. The key they told me is in the uniqueness of your template and how you optimize the entire page.

    Others have stated that the dupe penalty is applied to mirror sites where you take and copy exactly the entire site, including template and place on another server.

    Here's an article published recently in SiteProNews. I included the whole article but you can skip to the bold section if you desire.

    Google's SEO Advice
    For Your Website: Content
    By Joel Walsh (c) 2005

    The web pages actually at the top of Google have only one thing clearly in common: good writing. Don't get so caught up in the usual SEO sacred cows and bugbears, such as PageRank, frames, and JavaScrïpt, that you forget your site's content.
    I was recently struck by the fact that the top-ranking web pages on Google are consistently much better written than the vast majority of what one reads on the web.

    Of course, that shouldn't be a surprise, considering how often officials at Google proclaim the importance of good content. Yet traditional SEO wisdom has little to say about good writing.

    Does Google, the world's wealthiest media company, really ignore traditional standards of quality in the publishing world? Does Google, like so many website owners, really get so caught up in the process of the algorithm that it misses the whole point?

    Apparently not.

    Most Common On-the-Page Website Content Success Features

    Whatever the technical mechanism, Google is doing a pretty good job of identifying websites with good content and rewarding them with high rankings.

    I looked at Google's top five pages for the five most searched-on keywords, as identified by WordTracker on June 27, 2005. Typically, the top five pages receive an overwhelming majority of the traffïc delivered by Google.

    The web pages that contained written content (a small but significant portion were image galleries) all shared the following features:

    • Updating: frequent updating of content, at least once every few weeks, and more often, once a week or more.

    • Spelling and grammar: few or no errors. No page had more than three misspelled words or four grammatical errors. Note: spelling and grammar errors were identified by using Microsoft Word's chëck feature, and then ruling out words marked as misspellings that are either proper names or new words that are simply not in the dictionary. Does Google use SpellCheck? I can already hear the scoffing on the other side of this computer screen. Before you dismiss the idea completely, keep in mind that no one really does know what the 100 factors in Google's algorithm are. But whether the mechanism is SpellCheck or a better shot at link popularity thanks to great credibility, or something else entirely, the results remain the same.

    • Paragraphs: primarily brief (1-4 sentences). Few or no long blocks of text.

    • Lists: both bulleted and numbered, form a large part of the text.

    • Sentence length: mostly brief (10 words or fewer). Medium-length and long sentences are sprinkled throughout the text rather than clumped together.

    • Contextual relevance: text contains numerous terms related to the keyword, as well as stem variations of the keyword.

    SEO Bugbears and Sacred Cows

    A hard look at the results shows that, practically speaking, a number of SEO bugbears and sacred cows may matter less to ranking than good content.

    • PageRank. The median PageRank was 4. One page had a PageRank of 0. Of course, this might simply be yet another demonstration that the little PageRank number you get in your browser window is not what Google's algo is using. But if you're one of those people who attaches an overriding value to that little number, this is food for thought.

    • Frames. The top two web pages listed for the most searched-on keyword employ frames. Frames may still be a bad web design idea from a usability standpoint, and they may ruin your search engine rankings if your site's linking system depends on them. But there are worse ways you could shoot yourself in the foot.

    • JavaScript-formatted internal links. Most of the websites use JavaScrïpt for their internal page links. Again, that's not the best web design practice, but there are worse things you could do.

    • Links: Most of the web pages contained ten or more links; many contained over 30, in defiance of the SEO bugbears about "link popularity bleeding." Moreover, nearly all the pages contained a significant number of non-relevant links. On many pages, non-relevant links outnumbered relevant ones. Of course, it's not clear what benefit the website owners hope to get from placing irrelevant links on pages. It has been a proven way of lowering conversion rates and losing visitors. But Google doesn't seem to care if your website makes monëy.

    • Originality: a significant number of pages contained content copied from other websites. In all cases, the content was professionally written content apparently distributed on a free-reprint basis. Note: the reprint content did not consist of content feeds. However, no website consisted solely of free-reprint content. There was always at least a significant portion of original content, usually the majority of the page.

    Recommendations

    • • Make sure a professional writer, or at least someone who can tell good writing from bad, is creating your site's content, particularly in the case of a search-engine optimization campaign. If you are an SEO, make sure you get a pro to do the content. A shocking number of SEOs write incredibly badly. I've even had clients whose websites got fewer conversions or page views after their SEOs got through with them, even when they got a sharp uptick in unique visitors. Most visitors simply hit the "back" button when confronted with the unpalatable text, so the increased traffïc is just wasted bandwidth.

    • • If you write your own content, make sure that it passes through the hands of a skilled copyeditor or writer before going online.

    • • Update your content often. It's important both to add new pages and update existing pages. If you can't afford original content, use free-reprint content.

    • • Distribute your content to other websites on a free-reprint basis. This will help your website get links in exchange for the right to publish the content. It will also help spread your message and enhance your visibility. Fears of a "duplicate content penalty" for free-reprint content (as opposed to duplication of content within a single website) are unjustified.

    In short, if you have a mature website that is already indexed and getting traffïc, you should consider making sure the bulk of your investmënt in your website is devoted to its content, rather than graphic design, old-school search-engine optimization, or linking campaigns.


    About The Author
    Joel Walsh's archive of web business articles is at the website of his business, UpMarket Content, a website content provider.
     
    webgator, Mar 9, 2006 IP
  18. webgator

    webgator Well-Known Member

    Messages:
    122
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    130
    #38
    Here's what another webmaster recently posted this on his blog about using "duplicate content":

    ‘Fresh Content’ means new content for the created site that was not there before. It does not necessarily mean ‘just written’. You will be able to put it there from the article database OR from articles that you have personally written or had written. (aka "the Drip Effect")

    Sounds like I need to clear up this “Duplicate Content Penalty” that rumors talk so much about… it is not what you think. It amazes me how rumors can exploit things. I personally submit press releases and articles all the time, and if you search for the title of any of them you will get pages and pages of results back from Google, Yahoo and MSN. All these pages contain my articles or press releases and if duplicate content got your site banned why then are there hundreds of sites all showing up ? Would not only one of them show up?

    These sanctions imposed by google are hitting people that have duplicate content under one domain (which sites do not get banned for, the search engine just picks one of the pages instead of both). The other issue is if you have three domains all with the exact same templates and content on them. This was a popular strategy people use to try, the mind set was if this site makes me money what about three of the exact same sites, this will get you banned.

    Rest assurred many of my friends and I have been very successfully using other peoples content. Make sites that are topic focused authority sites, make every template different, add different content with the articles, get good incoming “Deep” links, make the site user friendly, follow the search engines guidelines and make the site USEFUL to the end user.
     
    webgator, Mar 9, 2006 IP
  19. jonhy.pear

    jonhy.pear Peon

    Messages:
    105
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #39
    Concerning the google's duplicate content filter I have a question, maybe silly, but still a question:

    if I have a site with the following sentence :

    "This is a sentence from the author X"

    and in another site I have the following sentence :

    "T h i s i s a sentence f r o m the a u t h o r X"

    Yes, with the spaces, that's the only difference.

    Will google realize that this is duplicated content?
     
    jonhy.pear, Apr 5, 2006 IP
  20. BWDOW

    BWDOW Guest

    Messages:
    210
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #40
    Using duplicate content may cause being banned from google
     
    BWDOW, Apr 5, 2006 IP