What constitutes similar pages?

Lucky Bastard Peon

Messages:: 406

Likes Received:: 10

Best Answers:: 0

Trophy Points:: 0

#1

Hard to explain, but when you go here :
http://www.google.com/search?q=site:www.digitalpoint.com&hl=en&lr=&start=930&sa=N
down the bottom it says :
In order to show you the most relevant results, we have omitted some entries very similar to the 939 already displayed.
If you like, you can repeat the search with the omitted results included.

Does anyone have any opinions of just what G considers to be "very similar" results? What does it take for a page NOT to be considered as such?

Lucky Bastard, Dec 7, 2004 IP

DomainLoot Guest

Best Answers:: 0

#2

I think???? part of this means results from the same URL, but different pages maybe?

Never gave it a lot of thought before now...

DomainLoot, Dec 7, 2004 IP

Owlcroft Peon

Messages:: 645

Likes Received:: 34

Best Answers:: 0

Trophy Points:: 0

#3

The question is very, very definitely not trivial, because "similar" pages can trip G's "duplicate-content" filter, which, by popular report, has recently gotten *much* more aggressive.

There are various tools on the web that will return a supposed measure of "percentage similarity" between any two selected pages. Does anyone have any experience-based information on roughly what percentage similarity triggers G's "duplicate" alarm? (Since the cranking up thereof in, I would say, late November?)

As I have remarked at great length on another thread here, perfectly innocent pages, whose real content is utterly different from one to another, can--it seems--unintentionally trip the alarm if the content is relatively brief compared to some page-common boilerplate; this will be especially true, as it seems to have been in my case, of index pages, where the real content is, say, 75 to 100 links. I have checked, and see figures from 35% to as high as 60% similarity between pages that any human would say are virtually 100% different.

Owlcroft, Dec 8, 2004 IP

SEbasic Peon

Messages:: 6,317

Likes Received:: 318

Best Answers:: 0

Trophy Points:: 0

#4

I believe that the old duplicate filter was working at aroung 80%.

This seems to have gone down an awful lot recently, although I can't provide any actual figures to assist with working out what the dup content filter is now running near.

SEbasic, Dec 8, 2004 IP

PR Weaver Peon

Messages:: 33

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#5

Can you please give me some examples of queries showing the problem of similar pages being penalised, but whithout the site: command?

Thanks,
Olivier Duffez
PR Weaver

PR Weaver, Dec 9, 2004 IP

Foxy Chief Natural Foodie

Messages:: 1,614

Likes Received:: 48

Best Answers:: 0

Trophy Points:: 0

#6

Lucky Bastard said:

Hard to explain, but when you go here :
http://www.google.com/search?q=site:www.digitalpoint.com&hl=en&lr=&start=930&sa=N
down the bottom it says :
In order to show you the most relevant results, we have omitted some entries very similar to the 939 already displayed.
If you like, you can repeat the search with the omitted results included.

Does anyone have any opinions of just what G considers to be "very similar" results? What does it take for a page NOT to be considered as such?
Click to expand...

Actually guys this is not about duplicate content [even though the comments above are good info] - this is about google displaying content and processor time.

What it does is to make an "arbitrary" decision on how many pages of one site you might like to look at!

To check this do the site:etc search
Note the number and the address of the last listed before the "similar" content. eg "351" and "Bog Rolls" http....

Now click on the "see the lot" and then scroll to the 351 address
Now research using the site:.... and you will find say 341 pages

Now do another research and it will be eg 371 pages

If you then click on 36 to see the 351 entry it will truncate the find
Click on 33 and it will truncate again.

Similar Pages just means more of the same of that site.

Foxy, Dec 10, 2004 IP

SEbasic Peon

Messages:: 6,317

Likes Received:: 318

Best Answers:: 0

Trophy Points:: 0

#7

Similar Pages just means more of the same of that site.
Click to expand...

Not true strictly speaking...

A lot of it goes from who is actually linking to you, and the relationship google perceves are between the sites.

This is all to do with LSI (Latent Semantic Indexing) IMO.
http://www.google.co.uk/search?hl=en&lr=&c2coff=1&safe=off&q=related:www.seo-dev.co.uk/

Forgive me if I got the wrong end of the stick with your post Foxy.

SEbasic, Dec 10, 2004 IP

suni Peon

Messages:: 5

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#8

Hello, I am new to this forum, and since I was reading this thread, I want to say that I agree with foxy: similar pages mean pages from "already displayed" urls.

suni, Dec 10, 2004 IP

SEbasic Peon

Messages:: 6,317

Likes Received:: 318

Best Answers:: 0

Trophy Points:: 0

#9

similar pages mean pages from "already displayed" urls.
Click to expand...

First, Welcome to the forum

Can you guys just clarify what you mean.
Maybe I've just missunderstood your posts, but the link I just pasted, kinda shows that isn't true...

SEbasic, Dec 10, 2004 IP

darksat Guest

Messages:: 1,239

Likes Received:: 16

Best Answers:: 0

Trophy Points:: 0

#10

SEbasic said:

Not true strictly speaking...

A lot of it goes from who is actually linking to you, and the relationship google perceves are between the sites.

This is all to do with LSI (Latent Semantic Indexing) IMO.
Click to expand...

I was aware that related sites were based on similar links but I dont think ive ever heard that specific phrase before, got any good white papers on it, sounds like a google term.

darksat, Dec 10, 2004 IP

SEbasic Peon

Messages:: 6,317

Likes Received:: 318

Best Answers:: 0

Trophy Points:: 0

#11

I have lots of links

This is a nice little tool
http://www.semantic-knowledge.com/

Lots of papers here
http://www.cs.utk.edu/~lsi/

I have more links, I just can't find them now...

This may give you a little bit of background info into the reasons I say this is because of LSI...
http://searchenginewatch.com/searchday/article.php/2196001

SEbasic, Dec 10, 2004 IP

Foxy Chief Natural Foodie

Messages:: 1,614

Likes Received:: 48

Best Answers:: 0

Trophy Points:: 0

#12

SEbasic said:

First, Welcome to the forum

Can you guys just clarify what you mean.
Maybe I've just missunderstood your posts, but the link I just pasted, kinda shows that isn't true...
Click to expand...

Welcome to the forum also

What was originally asked was when you search site: etc at the end of the listings you get something like this:

WebÂ
Results 351 - 352 of about 998 from www.ski-france-ok.com for . (0.60 seconds)Â

powered by zFeeder
last updated: Wed, 29 Sep 2004 05:09:55 GMT. US economy now faces
$50-a-barrel oil. Lowest US oil inventories in 30 years may pinch ...
www.ski-france-ok.com/newsfeeds/ zfeeder.php?zfposition=p3,p5,p6,p7,p8,p10, - 16k - Supplemental Result - Cached - SimilarÂ pages

Booking skiing and snowboarding holidays. Chalet Ch Hotel
Chalet C Hotel. Booking a skiing and snowboarding holiday in Tignes the French Alps.
The simplist way is to send us an email, or visit our contact page.
www.ski-france-ok.com/ snowboarding-france/hotel-bookings.html - 24k - Cached - SimilarÂ pages

In order to show you the most relevant results, we have omitted some entries very similar to the 352 already displayed.
If you like, you can repeat the search with the omitted results included.
Click to expand...

If you click on the 'see all' you will see the remaining pages of 998

The question was "were these pages called similar pages duplicate content?" The answer of course is no.

Foxy, Dec 10, 2004 IP

SEbasic Peon

Messages:: 6,317

Likes Received:: 318

Best Answers:: 0

Trophy Points:: 0

#13

Gotcha - Sorry for the confusion

SEbasic, Dec 10, 2004 IP

Foxy Chief Natural Foodie

Messages:: 1,614

Likes Received:: 48

Best Answers:: 0

Trophy Points:: 0

#14

Hehehehehe

Foxy, Dec 10, 2004 IP

Owlcroft Peon

Messages:: 645

Likes Received:: 34

Best Answers:: 0

Trophy Points:: 0

#15

Does anyone have any numerical idea of what the G "duplicate-content" filter has been cranked up to?

Someone posted that at some past time it was perceived as being at about 80% similarity. I suspect--though I cannot be sure or close to it--that by now it is operating below 50%, perhaps in the 40% range.

I have modified a large (10,000+) set of site-index pages, which were, by some measuring tool on the web, coming in at 40% to 60% similarity (because even with minimal surrounding boilerplate, 100 one-to-three-word links are not a large part of any page's text), so that some extra nominally relevant (and download-time-wasting, thank you Google) material is tacked on; my new figures look like 20% to 30% similarity, so we'll see if G will start indexing them again.

Owlcroft, Dec 10, 2004 IP

crew Peon

Messages:: 225

Likes Received:: 7

Best Answers:: 0

Trophy Points:: 0

#16

For the past couple of weeks, I've had sites with many hundreds of pages of unique content stuck at about 200 pages of 'non-similar' content. I started these pages at the same time, and 'marketed' them PR-wise in similar ways. It's baffling to me that 3 separate sites with unique content all become stuck within 10 pages of 200 total 'non-similar' pages. Last week, Googelbot hit about 1,000 pages of one site and it doubled (convieniently) too appx. 400 'non-sim' pages. I'm starting to think that Google might have a limit based on PR, time since initial index, and maybe some other factors (for example, I don't think it is too difficult to structurally or semantically identify a blog or a directory) to determine the total number of 'non-similar' pages. I don't think it has anything to do with 'similarity' as we commonly define it. 100, 200, 400....maybe it;s just a coincidence, but it seems like a good way to make sure that content is legit before getting permanently indexed.

My plan to improve this is to increase my PR. I know PR isn't real important for search results, but I could see G still using it to determine how deep or thorough a crawl of a site is.

Anyway, just some thoughts. Nothing concrete to back it up, but it feels like a pattern to me.

crew, Dec 11, 2004 IP

Owlcroft Peon

Messages:: 645

Likes Received:: 34

Best Answers:: 0

Trophy Points:: 0

#17

"I'm starting to think that Google might have a limit based on PR, time since initial index, and maybe some other factors (for example, I don't think it is too difficult to structurally or semantically identify a blog or a directory) to determine the total number of 'non-similar' pages. I don't think it has anything to do with 'similarity' as we commonly define it."

--------------

And I'm starting to think that Google has just flat-out gone off the rails. Somebody, somewhere within Google had a wet dream, and was highly enough placed to get it implemented. The results are insane and catastrophic, but when has that ever bothered Google?

"We don't care--we don't have to."

Sigh.

Owlcroft, Dec 12, 2004 IP

Mel Peon

Messages:: 369

Likes Received:: 14

Best Answers:: 0

Trophy Points:: 0

#18

When Google says this at the end of a search:

In order to show you the most relevant results, we have omitted some entries very similar to the 497 already displayed.
If you like, you can repeat the search with the omitted results included.
Click to expand...

I do not think they are actually talking about comparing pages for similarities, but have elected to shorten the results in order to provide more relevant results and I suspect (though cannot prove) that this would probably be the result of the ranking time duplicate filter.

Google actually has patented two "similar page" detection methods, one which it runs at ranking time and which is based on the similarity between SERPs listings (page title and snippets) and one which compares both sections of the page for similarities and the entire page for similarities. The second filter I suspect would be run on the index and the results precomputed and stored as "fingerprints".

In short I suspect that this message and the ommission of some pages in the SERPs comes about as the result of the ranking time duplicate filter, and not the filter which excludes pages based on similar page content.

Mel, Dec 12, 2004 IP

Log in or Sign up

What constitutes similar pages?

Lucky Bastard Peon

DomainLoot Guest

Owlcroft Peon

SEbasic Peon

PR Weaver Peon

Foxy Chief Natural Foodie

SEbasic Peon

suni Peon

SEbasic Peon

darksat Guest

SEbasic Peon

Foxy Chief Natural Foodie

SEbasic Peon

Foxy Chief Natural Foodie

Owlcroft Peon

crew Peon

Owlcroft Peon

Mel Peon

Useful Searches