What constitutes similar pages?

Discussion in 'Google' started by Lucky Bastard, Dec 7, 2004.

  1. #1
    Hard to explain, but when you go here :
    http://www.google.com/search?q=site:www.digitalpoint.com&hl=en&lr=&start=930&sa=N
    down the bottom it says :
    In order to show you the most relevant results, we have omitted some entries very similar to the 939 already displayed.
    If you like, you can repeat the search with the omitted results included.

    Does anyone have any opinions of just what G considers to be "very similar" results? What does it take for a page NOT to be considered as such?
     
    Lucky Bastard, Dec 7, 2004 IP
  2. DomainLoot

    DomainLoot Guest

    Best Answers:
    0
    #2
    I think???? part of this means results from the same URL, but different pages maybe?

    Never gave it a lot of thought before now...
     
    DomainLoot, Dec 7, 2004 IP
  3. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #3
    The question is very, very definitely not trivial, because "similar" pages can trip G's "duplicate-content" filter, which, by popular report, has recently gotten *much* more aggressive.

    There are various tools on the web that will return a supposed measure of "percentage similarity" between any two selected pages. Does anyone have any experience-based information on roughly what percentage similarity triggers G's "duplicate" alarm? (Since the cranking up thereof in, I would say, late November?)

    As I have remarked at great length on another thread here, perfectly innocent pages, whose real content is utterly different from one to another, can--it seems--unintentionally trip the alarm if the content is relatively brief compared to some page-common boilerplate; this will be especially true, as it seems to have been in my case, of index pages, where the real content is, say, 75 to 100 links. I have checked, and see figures from 35% to as high as 60% similarity between pages that any human would say are virtually 100% different.
     
    Owlcroft, Dec 8, 2004 IP
  4. SEbasic

    SEbasic Peon

    Messages:
    6,317
    Likes Received:
    318
    Best Answers:
    0
    Trophy Points:
    0
    #4
    I believe that the old duplicate filter was working at aroung 80%.

    This seems to have gone down an awful lot recently, although I can't provide any actual figures to assist with working out what the dup content filter is now running near.
     
    SEbasic, Dec 8, 2004 IP
  5. PR Weaver

    PR Weaver Peon

    Messages:
    33
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Can you please give me some examples of queries showing the problem of similar pages being penalised, but whithout the site: command?

    Thanks,
    Olivier Duffez
    PR Weaver
     
    PR Weaver, Dec 9, 2004 IP
  6. Foxy

    Foxy Chief Natural Foodie

    Messages:
    1,614
    Likes Received:
    48
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Actually guys this is not about duplicate content [even though the comments above are good info] - this is about google displaying content and processor time.

    What it does is to make an "arbitrary" decision on how many pages of one site you might like to look at!

    To check this do the site:etc search
    Note the number and the address of the last listed before the "similar" content. eg "351" and "Bog Rolls" http....

    Now click on the "see the lot" and then scroll to the 351 address
    Now research using the site:.... and you will find say 341 pages

    Now do another research and it will be eg 371 pages

    If you then click on 36 to see the 351 entry it will truncate the find
    Click on 33 and it will truncate again.

    Similar Pages just means more of the same of that site. :)
     
    Foxy, Dec 10, 2004 IP
  7. SEbasic

    SEbasic Peon

    Messages:
    6,317
    Likes Received:
    318
    Best Answers:
    0
    Trophy Points:
    0
    #7
    SEbasic, Dec 10, 2004 IP
  8. suni

    suni Peon

    Messages:
    5
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Hello, I am new to this forum, and since I was reading this thread, I want to say that I agree with foxy: similar pages mean pages from "already displayed" urls.
     
    suni, Dec 10, 2004 IP
  9. SEbasic

    SEbasic Peon

    Messages:
    6,317
    Likes Received:
    318
    Best Answers:
    0
    Trophy Points:
    0
    #9
    First, Welcome to the forum :)

    Can you guys just clarify what you mean.
    Maybe I've just missunderstood your posts, but the link I just pasted, kinda shows that isn't true...
     
    SEbasic, Dec 10, 2004 IP
  10. darksat

    darksat Guest

    Messages:
    1,239
    Likes Received:
    16
    Best Answers:
    0
    Trophy Points:
    0
    #10
    I was aware that related sites were based on similar links but I dont think ive ever heard that specific phrase before, got any good white papers on it, sounds like a google term.
     
    darksat, Dec 10, 2004 IP
  11. SEbasic

    SEbasic Peon

    Messages:
    6,317
    Likes Received:
    318
    Best Answers:
    0
    Trophy Points:
    0
    #11
    SEbasic, Dec 10, 2004 IP
  12. Foxy

    Foxy Chief Natural Foodie

    Messages:
    1,614
    Likes Received:
    48
    Best Answers:
    0
    Trophy Points:
    0
    #12
    Welcome to the forum also

    What was originally asked was when you search site: etc at the end of the listings you get something like this:

    If you click on the 'see all' you will see the remaining pages of 998

    The question was "were these pages called similar pages duplicate content?" The answer of course is no. :)
     
    Foxy, Dec 10, 2004 IP
  13. SEbasic

    SEbasic Peon

    Messages:
    6,317
    Likes Received:
    318
    Best Answers:
    0
    Trophy Points:
    0
    #13
    Gotcha - Sorry for the confusion ;)
     
    SEbasic, Dec 10, 2004 IP
  14. Foxy

    Foxy Chief Natural Foodie

    Messages:
    1,614
    Likes Received:
    48
    Best Answers:
    0
    Trophy Points:
    0
    #14
    Hehehehehe
     
    Foxy, Dec 10, 2004 IP
  15. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #15
    Does anyone have any numerical idea of what the G "duplicate-content" filter has been cranked up to?

    Someone posted that at some past time it was perceived as being at about 80% similarity. I suspect--though I cannot be sure or close to it--that by now it is operating below 50%, perhaps in the 40% range.

    I have modified a large (10,000+) set of site-index pages, which were, by some measuring tool on the web, coming in at 40% to 60% similarity (because even with minimal surrounding boilerplate, 100 one-to-three-word links are not a large part of any page's text), so that some extra nominally relevant (and download-time-wasting, thank you Google) material is tacked on; my new figures look like 20% to 30% similarity, so we'll see if G will start indexing them again.
     
    Owlcroft, Dec 10, 2004 IP
  16. crew

    crew Peon

    Messages:
    225
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #16
    For the past couple of weeks, I've had sites with many hundreds of pages of unique content stuck at about 200 pages of 'non-similar' content. I started these pages at the same time, and 'marketed' them PR-wise in similar ways. It's baffling to me that 3 separate sites with unique content all become stuck within 10 pages of 200 total 'non-similar' pages. Last week, Googelbot hit about 1,000 pages of one site and it doubled (convieniently) too appx. 400 'non-sim' pages. I'm starting to think that Google might have a limit based on PR, time since initial index, and maybe some other factors (for example, I don't think it is too difficult to structurally or semantically identify a blog or a directory) to determine the total number of 'non-similar' pages. I don't think it has anything to do with 'similarity' as we commonly define it. 100, 200, 400....maybe it;s just a coincidence, but it seems like a good way to make sure that content is legit before getting permanently indexed.

    My plan to improve this is to increase my PR. I know PR isn't real important for search results, but I could see G still using it to determine how deep or thorough a crawl of a site is.

    Anyway, just some thoughts. Nothing concrete to back it up, but it feels like a pattern to me.
     
    crew, Dec 11, 2004 IP
  17. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #17
    "I'm starting to think that Google might have a limit based on PR, time since initial index, and maybe some other factors (for example, I don't think it is too difficult to structurally or semantically identify a blog or a directory) to determine the total number of 'non-similar' pages. I don't think it has anything to do with 'similarity' as we commonly define it."

    --------------

    And I'm starting to think that Google has just flat-out gone off the rails. Somebody, somewhere within Google had a wet dream, and was highly enough placed to get it implemented. The results are insane and catastrophic, but when has that ever bothered Google?

    "We don't care--we don't have to."

    Sigh.
     
    Owlcroft, Dec 12, 2004 IP
  18. Mel

    Mel Peon

    Messages:
    369
    Likes Received:
    14
    Best Answers:
    0
    Trophy Points:
    0
    #18
    When Google says this at the end of a search:

    I do not think they are actually talking about comparing pages for similarities, but have elected to shorten the results in order to provide more relevant results and I suspect (though cannot prove) that this would probably be the result of the ranking time duplicate filter.

    Google actually has patented two "similar page" detection methods, one which it runs at ranking time and which is based on the similarity between SERPs listings (page title and snippets) and one which compares both sections of the page for similarities and the entire page for similarities. The second filter I suspect would be run on the index and the results precomputed and stored as "fingerprints".

    In short I suspect that this message and the ommission of some pages in the SERPs comes about as the result of the ranking time duplicate filter, and not the filter which excludes pages based on similar page content.
     
    Mel, Dec 12, 2004 IP