I find the duplicate content filter is fairly light in the way it only affects pages with more than say 90% of matching content. This can be seen by searching for a news headline: there will often be many pages with a substantially duplicate content.
Hmm what happens if you have pretty "loose" mod_rewrites? That is, a visitor can visit a page on my site with www.mysite.com/<anything>/12/index.htm So mysite.com/Apple/12/index.htm mysite.com/Pear/12/index.htm will be flagged as duplicate content on my site?
How does google ever know which webmaster duplicated content ? It's not always a fact that the site that shows up with the content first in the google database is the original creator.
Google will normally filter out the page with the lower page rank, it is this very thing that is exploited in one black hat trick.
Sure it is Whatever page is found first by Google will be the "original" page, well, that's what they are working on last I heard, dont know if that's in effect yet tho. Bompa
Bompa sadly that is not how it works. Google will always show what it considers to be the most relevant content for any search. If it considers two pages to be identical, then it will use pagerank as a tie breaker. The page with the highest page rank is the one that stays. It is a common trick to scrape a competitors page that is above you, then cloak it on your own site via IP cloaking. Google then sees two pages of identical content for that search . Page rank kicks in, and your competitor is thrown out of the SERPS. Of course this is black hat SE Positioning, it will get you banned, and your page MUST have a higher page rank than your competitors also.
Why would it get me banned? Maybe it really is my content. Now we are back to "How does Google know who is the real owner of the content?" You are saying that PageRank determines ownership and I am not arguing that. What I said is that they are working on a "first found" type of sytem. Bompa
Apologies Bompa, I really should have read the entire post first lol. As for why it should get you banned, it doesn't. The way that scum trick works is to get you filtered. Google filters duplicate pages all the time from their SERP's otherwise they would befilled with the same content.
duplicate content ..I do not believe ...after testing these two sites: vietnamparadisetravel.com and travelsvietnam.com they have 94% same content but they still have very good rank on google: try to search with "vietnam travel" "vietnam tour" and I know that they do not have only 2 site but 3 or 4 sites (and one more: vietnamstopover.com) which have the same content and have good rank as well. I do a whois and found out that they host their site on different servers ... (different IP) so ..is it the problem for my 2 sites is...they are hosted on the same host, same server so google knows that and it only responses one of my site on search result but not both ....And I do not believe that google can know who owns 2 websites and only lists one of 2. any ideas?
Hmmm, why is it that when I am searching for a perl function or info about mod_rewrite, or whatever, I get so many pages in the results that are all the "man" page? They are all dupes! Why are those not filtered out? Bompa
Applying filters is a resource hungry exercise. I and many others believe that Google operates a multi tier filtering system. The more commercial the phrase, the harder the filter This ensures they are not wasting resources on non competitive and non commercial phrases. You will always find exceptions to the rule, but rest assured there IS a duplication filter. Also kep in mind that when talking about duplicate pages, we are not talking about what the browser interperts the page as, we are tlaking about the underlying code. Google doesn't process code, it simply takes it. I could make 3 pages all appearing identical, but with totally different code. 1 in tables, 1 in CSS absolute positioning, and one with dynamic content. all look identical, but all 3 different.
THanhNGuyen Yes, that is an easy catch for Google when they are hosted on the same domain. Host them on different domains. Of course if they think that they are the same/similar they may check your DNS registration and see that you own both domains as well. Of course this is only if they care that the keywords you are targeting are competitive enough to worry about. With the BigDaddy update, it will be interesting to see if they have increased their processing power enough to care all the time regardless of the competition factor.
Old Welsh Guy, I'm not sure I understand what you are trying to say in your last post - 3rd paragraph. Are you saying that google doesn't consider the 3 pages to be the same even thought the content is the same because they are coded differently? Because if you are, I would have to disagree to the extent that I think that Google is looking at the final content as it would be rendered to determine duplicate content not just the scanned source of the page as it is sent from the web server.
HI Tony, and what makes you think that the search engines are actually processing code and ranking final pages? Bearing in mind that they are not calling the elements within that page when spidering? I am not saying your wrong, I am genuinely asking the question.
I think that what you guys might be looking for here.. http://www.seroundtable.com/archives/003398.html There's quite a lot of interesting info there, so i can't just copy it all, better go there and read it if you haven't yet, very useful..
Please have a look: have 2 websites about Travel and tourism, one is .com and the other is .co.uk My both sites appeared online 2 years ago, Now The .com has PR3 and the .co.uk has rank 4, both is listed by google. I must say that they have the same owner and hosted on the same server (same IP address) but they don't have the same content, just very fews pages have duplicate content around 80% (it's really rare). My problem is that both sites changing SERP for each other. For Example when I did search with keyword "vietnam holidays" (google Se) one site appeared in the page 2 with rank 4 for a week. when I did the same for the next week, The other site appeared with the same position and the site which appeared in last week had gone. All mean that only one site appears in google, not both of them. Maybe google filler the content and only list one site in result page. But I don't belieave that when researching my competitor's 3 sites. They have 3 website with aproximately 99% the same content but all 3sites are listed very well and have high PR, PR5 and have the good position on the first result page with the competitive keyword such as: vietnam travel, vietnam tour, vietnam vacation, vietnam holidays ... If you have experience, please post your Idear about this issue. My competitor's 3 sites that I mention to are: httpx://www.vietnamparadisetravel.com httpx://www.travelsvietnam.com httpx://www.vietnamstopover.com My .co.uk site is: 2www.vietnam-holidays.co.uk My .com site is: 2www.haivenu-vietnam.com Highly apreciate if you do review my 2 sites. Thanks
Wow talk about beating a dead horse! No one knows for certain how the dupe filter is applied. I know a few webmasters that have sites that are 95% articles drawn from article sites such as ezinearticles and they do very well in the se's. The key they told me is in the uniqueness of your template and how you optimize the entire page. Others have stated that the dupe penalty is applied to mirror sites where you take and copy exactly the entire site, including template and place on another server. Here's an article published recently in SiteProNews. I included the whole article but you can skip to the bold section if you desire. Google's SEO Advice For Your Website: Content By Joel Walsh (c) 2005 The web pages actually at the top of Google have only one thing clearly in common: good writing. Don't get so caught up in the usual SEO sacred cows and bugbears, such as PageRank, frames, and JavaScrïpt, that you forget your site's content. I was recently struck by the fact that the top-ranking web pages on Google are consistently much better written than the vast majority of what one reads on the web. Of course, that shouldn't be a surprise, considering how often officials at Google proclaim the importance of good content. Yet traditional SEO wisdom has little to say about good writing. Does Google, the world's wealthiest media company, really ignore traditional standards of quality in the publishing world? Does Google, like so many website owners, really get so caught up in the process of the algorithm that it misses the whole point? Apparently not. Most Common On-the-Page Website Content Success Features Whatever the technical mechanism, Google is doing a pretty good job of identifying websites with good content and rewarding them with high rankings. I looked at Google's top five pages for the five most searched-on keywords, as identified by WordTracker on June 27, 2005. Typically, the top five pages receive an overwhelming majority of the traffïc delivered by Google. The web pages that contained written content (a small but significant portion were image galleries) all shared the following features: • Updating: frequent updating of content, at least once every few weeks, and more often, once a week or more. • Spelling and grammar: few or no errors. No page had more than three misspelled words or four grammatical errors. Note: spelling and grammar errors were identified by using Microsoft Word's chëck feature, and then ruling out words marked as misspellings that are either proper names or new words that are simply not in the dictionary. Does Google use SpellCheck? I can already hear the scoffing on the other side of this computer screen. Before you dismiss the idea completely, keep in mind that no one really does know what the 100 factors in Google's algorithm are. But whether the mechanism is SpellCheck or a better shot at link popularity thanks to great credibility, or something else entirely, the results remain the same. • Paragraphs: primarily brief (1-4 sentences). Few or no long blocks of text. • Lists: both bulleted and numbered, form a large part of the text. • Sentence length: mostly brief (10 words or fewer). Medium-length and long sentences are sprinkled throughout the text rather than clumped together. • Contextual relevance: text contains numerous terms related to the keyword, as well as stem variations of the keyword. SEO Bugbears and Sacred Cows A hard look at the results shows that, practically speaking, a number of SEO bugbears and sacred cows may matter less to ranking than good content. • PageRank. The median PageRank was 4. One page had a PageRank of 0. Of course, this might simply be yet another demonstration that the little PageRank number you get in your browser window is not what Google's algo is using. But if you're one of those people who attaches an overriding value to that little number, this is food for thought. • Frames. The top two web pages listed for the most searched-on keyword employ frames. Frames may still be a bad web design idea from a usability standpoint, and they may ruin your search engine rankings if your site's linking system depends on them. But there are worse ways you could shoot yourself in the foot. • JavaScript-formatted internal links. Most of the websites use JavaScrïpt for their internal page links. Again, that's not the best web design practice, but there are worse things you could do. • Links: Most of the web pages contained ten or more links; many contained over 30, in defiance of the SEO bugbears about "link popularity bleeding." Moreover, nearly all the pages contained a significant number of non-relevant links. On many pages, non-relevant links outnumbered relevant ones. Of course, it's not clear what benefit the website owners hope to get from placing irrelevant links on pages. It has been a proven way of lowering conversion rates and losing visitors. But Google doesn't seem to care if your website makes monëy. • Originality: a significant number of pages contained content copied from other websites. In all cases, the content was professionally written content apparently distributed on a free-reprint basis. Note: the reprint content did not consist of content feeds. However, no website consisted solely of free-reprint content. There was always at least a significant portion of original content, usually the majority of the page. Recommendations • • Make sure a professional writer, or at least someone who can tell good writing from bad, is creating your site's content, particularly in the case of a search-engine optimization campaign. If you are an SEO, make sure you get a pro to do the content. A shocking number of SEOs write incredibly badly. I've even had clients whose websites got fewer conversions or page views after their SEOs got through with them, even when they got a sharp uptick in unique visitors. Most visitors simply hit the "back" button when confronted with the unpalatable text, so the increased traffïc is just wasted bandwidth. • • If you write your own content, make sure that it passes through the hands of a skilled copyeditor or writer before going online. • • Update your content often. It's important both to add new pages and update existing pages. If you can't afford original content, use free-reprint content. • • Distribute your content to other websites on a free-reprint basis. This will help your website get links in exchange for the right to publish the content. It will also help spread your message and enhance your visibility. Fears of a "duplicate content penalty" for free-reprint content (as opposed to duplication of content within a single website) are unjustified. In short, if you have a mature website that is already indexed and getting traffïc, you should consider making sure the bulk of your investmënt in your website is devoted to its content, rather than graphic design, old-school search-engine optimization, or linking campaigns. About The Author Joel Walsh's archive of web business articles is at the website of his business, UpMarket Content, a website content provider.
Here's what another webmaster recently posted this on his blog about using "duplicate content": ‘Fresh Content’ means new content for the created site that was not there before. It does not necessarily mean ‘just written’. You will be able to put it there from the article database OR from articles that you have personally written or had written. (aka "the Drip Effect") Sounds like I need to clear up this “Duplicate Content Penalty†that rumors talk so much about… it is not what you think. It amazes me how rumors can exploit things. I personally submit press releases and articles all the time, and if you search for the title of any of them you will get pages and pages of results back from Google, Yahoo and MSN. All these pages contain my articles or press releases and if duplicate content got your site banned why then are there hundreds of sites all showing up ? Would not only one of them show up? These sanctions imposed by google are hitting people that have duplicate content under one domain (which sites do not get banned for, the search engine just picks one of the pages instead of both). The other issue is if you have three domains all with the exact same templates and content on them. This was a popular strategy people use to try, the mind set was if this site makes me money what about three of the exact same sites, this will get you banned. Rest assurred many of my friends and I have been very successfully using other peoples content. Make sites that are topic focused authority sites, make every template different, add different content with the articles, get good incoming “Deep†links, make the site user friendly, follow the search engines guidelines and make the site USEFUL to the end user.
Concerning the google's duplicate content filter I have a question, maybe silly, but still a question: if I have a site with the following sentence : "This is a sentence from the author X" and in another site I have the following sentence : "T h i s i s a sentence f r o m the a u t h o r X" Yes, with the spaces, that's the only difference. Will google realize that this is duplicated content?