Duplicate Content Filter

Discussion in 'Search Engine Optimization' started by elklabone, Jan 17, 2005.

  1. #1
    If we tripped it, how long for it to go away and re-index us?

    Also, if we find some good information that relates to our company (It's a standard FAQ that goes on all reps for this company) but they're all like PR1...

    And we duplicate the FAQ on a PR4 page, would they become "our duplicate"? Or does it go by age of page?

    --Mark
     
    elklabone, Jan 17, 2005 IP
  2. Smyrl

    Smyrl Tomato Republic Staff

    Messages:
    13,740
    Likes Received:
    1,702
    Best Answers:
    78
    Trophy Points:
    510
    #2
    When I suspected a paragraph I was using on two sites was causing problems, I immediately removed offending paragraph and redid second page. It did not take long to see results of the change. Of course I will never know if that was the problem but the paragraph appeared on two of my sites and th second site had little real content on page otther than the duplicated paragraph.

    Shannon
     
    Smyrl, Jan 17, 2005 IP
  3. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #3
    I have yet to find any usefully reliable information on just what is likely to trip that purported filter. There are tools available on the web (such as the Simliar Pages Checker ) that purport to say, as a percentage, how "similar" two pages are, but no one can know if the methods Google uses are essentially the same. I have seen speculation--and that's all it was--that once it needed 80% similarity or so to trip the filter, but that now the figure is down considerably, to perhaps 60% or even less.

    In my definite opinion, unless G has some algo cleverer than their prior art would lead one to expect, a tight duplicate-content filter is disastrously unfair. There are large numbers of sites with large numbers of perfectly valid, useful pages that will necessarily have high similarity percentages: catalogues, indices, and the like. Such pages will, of their inherent nature, have a high "overhead" of boilerplate wrapped around the perhaps relatively small--but crucial--amount of unique information.

    I have come to suspect that the reason that many sites, (for a notorious example, Amazon) wrap all their pages with so much "customized" or randomly served baloney is not just sheer marketing overkill, but a perceived need to avoid tripping that essentially ridiculous filter.

    I am, so far as I can tell (Google stubbornly refuses to reply with any tiniest hint to repeated requests for assistance), running into problems with that filter on my encyclopedia site, and it is perhaps a good example of the difficulties. To begin with, there are nearly a million articles in the thing; that--using G's recommendation of not over 100 links a page--mandates about 10,000 pages that are sheer index. When a page has just 100 lines of unique content surrounded by even a modest amount of boilerplate, it will absolutely have to have a very high similarity to the other 9,999 such pages.

    To help, I have added to each index page the entire content of a randomly selected encyclopedia article from within the range indexed on a given page. Even with that, when I just submitted a half dozen randomly selected index-page pairs to the checker, I saw a low of 35% similarity to a high of 67%. Thus, if the filter is set at, say, 50%, at least some, perhaps many, of those pages--which are, to a human eye, drastically different--would trip that filter. I am already burdening the pages with excess junk they neither want nor need; do I have to add yet more? Is G essentially forcing sites to clutter up their pages with Amazon-like crap? It sure looks that way.

    The symptom is that Google is taking site pages out of its index about as fast as it adds them in. At one time, we had climbed up to about a quarter-million pages indexed, then we saw the total drop, day by day, down to about 28K. Since then, it has fluctuated up to about 32K then down to about 30K. We don't especially need significant SERPs for the pages--we just urgently need for them to be indexed. G absolutely, positively will not give any indication of why they are removing pages as fast as they add them. And our front page hasn't been in the index for a couple of months now (though the Firefox PR Bar still says it's PR5).

    In response to G's usual potted reply "look over our guidelines", I sent them back a complete set of all their several posted sets of guidelines, with each and every one annotated with comments that we have met or exceeded their requirements (or "recomendations"), but, rather obviously, no one there ever troubled to read it.

    That bothers me a lot. Google can make all the claims it wants about being a private company exercising its free-speech rights, but in fact it is very like a credit-reporting agency. Such an agency cannot include or exclude information, or construct a "credit score", by means that are not objective, rational, and knowable. But Google can and does construct ratings and listings in secret and arguably non-objective ways, despite the reality that its actions have at least as much effect on many individuals and businesses as does a credit score with Equifax.

    (One could argue that algorithms are "objective", but if Equifax applied an "algorithm" that lowered scores based on neighborhood, as banks used to and possibly still secretly do, no one would say that that was "objective" treatment.)
     
    Owlcroft, Jan 17, 2005 IP
  4. billion

    billion G.E.M.

    Messages:
    423
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    140
    #4
    Does anyone know if it's the actual text or layout of the page that is the problem?

    Will it help if sentences is shown in a random order or should the layout of the page be changed?

    Maybe I should test this myself, I was just curious if someone already know anything about it. :)
     
    billion, Jan 18, 2005 IP
  5. spdude

    spdude Guest

    Messages:
    1,315
    Likes Received:
    86
    Best Answers:
    0
    Trophy Points:
    0
    #5
    I have about a dozen pages with very little text on them. The layout i.e. header, footer, menu, header jpg image etc. are the same on all of them.

    The title tags, H1 titles and small amount of text is different though by a good 40%. There is some duplication though. All the pages are indexed and regularly crawled. In light of this, I would say it's the actual body text which triggers the filter and not the layout alone.

    In contrast, earlier, we put up some resource pages just so they can gather PR, and we can use them for link trading. The pages were empty. No text at all besides the header and footer. Only one got indexed and the others never showed a cache, even after several weeks. After realising this, we made them a bit different in the titles and a sentence or two in the main body. This was enough to get them all crawled.
     
    spdude, Jan 18, 2005 IP
  6. Refrozen

    Refrozen Peon

    Messages:
    318
    Likes Received:
    9
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Refrozen, Jan 18, 2005 IP
  7. leeds1

    leeds1 Peon

    Messages:
    585
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    0
    #7
    yes, I think it's the actual text

    I had most of my pages with testimonials on and a copywrite footer

    All those pages are *gone*

    Others without those are still in the game

    I have re-written/ coded my site all this afternoon

    Bummer - no sales patter now :mad:
     
    leeds1, Jan 19, 2005 IP
  8. nichewriter

    nichewriter Peon

    Messages:
    27
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Apparently you now need 50 to 60% uniqueness in your text body, 35% doesn't cut it any longer.
     
    nichewriter, Apr 10, 2011 IP
  9. music41410

    music41410 Peon

    Messages:
    20
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #9
    nichewriter, how did you get this number?
     
    music41410, Apr 12, 2011 IP
  10. nichewriter

    nichewriter Peon

    Messages:
    27
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #10
    Read it on this forum, sorry don't remember which thread.
     
    nichewriter, Apr 21, 2011 IP