Hard to explain, but when you go here : http://www.google.com/search?q=site:www.digitalpoint.com&hl=en&lr=&start=930&sa=N down the bottom it says : In order to show you the most relevant results, we have omitted some entries very similar to the 939 already displayed. If you like, you can repeat the search with the omitted results included. Does anyone have any opinions of just what G considers to be "very similar" results? What does it take for a page NOT to be considered as such?
I think???? part of this means results from the same URL, but different pages maybe? Never gave it a lot of thought before now...
The question is very, very definitely not trivial, because "similar" pages can trip G's "duplicate-content" filter, which, by popular report, has recently gotten *much* more aggressive. There are various tools on the web that will return a supposed measure of "percentage similarity" between any two selected pages. Does anyone have any experience-based information on roughly what percentage similarity triggers G's "duplicate" alarm? (Since the cranking up thereof in, I would say, late November?) As I have remarked at great length on another thread here, perfectly innocent pages, whose real content is utterly different from one to another, can--it seems--unintentionally trip the alarm if the content is relatively brief compared to some page-common boilerplate; this will be especially true, as it seems to have been in my case, of index pages, where the real content is, say, 75 to 100 links. I have checked, and see figures from 35% to as high as 60% similarity between pages that any human would say are virtually 100% different.
I believe that the old duplicate filter was working at aroung 80%. This seems to have gone down an awful lot recently, although I can't provide any actual figures to assist with working out what the dup content filter is now running near.
Can you please give me some examples of queries showing the problem of similar pages being penalised, but whithout the site: command? Thanks, Olivier Duffez PR Weaver
Actually guys this is not about duplicate content [even though the comments above are good info] - this is about google displaying content and processor time. What it does is to make an "arbitrary" decision on how many pages of one site you might like to look at! To check this do the site:etc search Note the number and the address of the last listed before the "similar" content. eg "351" and "Bog Rolls" http.... Now click on the "see the lot" and then scroll to the 351 address Now research using the site:.... and you will find say 341 pages Now do another research and it will be eg 371 pages If you then click on 36 to see the 351 entry it will truncate the find Click on 33 and it will truncate again. Similar Pages just means more of the same of that site.
Not true strictly speaking... A lot of it goes from who is actually linking to you, and the relationship google perceves are between the sites. This is all to do with LSI (Latent Semantic Indexing) IMO. http://www.google.co.uk/search?hl=en&lr=&c2coff=1&safe=off&q=related:www.seo-dev.co.uk/ Forgive me if I got the wrong end of the stick with your post Foxy.
Hello, I am new to this forum, and since I was reading this thread, I want to say that I agree with foxy: similar pages mean pages from "already displayed" urls.
First, Welcome to the forum Can you guys just clarify what you mean. Maybe I've just missunderstood your posts, but the link I just pasted, kinda shows that isn't true...
I was aware that related sites were based on similar links but I dont think ive ever heard that specific phrase before, got any good white papers on it, sounds like a google term.
I have lots of links This is a nice little tool http://www.semantic-knowledge.com/ Lots of papers here http://www.cs.utk.edu/~lsi/ I have more links, I just can't find them now... This may give you a little bit of background info into the reasons I say this is because of LSI... http://searchenginewatch.com/searchday/article.php/2196001
Welcome to the forum also What was originally asked was when you search site: etc at the end of the listings you get something like this: If you click on the 'see all' you will see the remaining pages of 998 The question was "were these pages called similar pages duplicate content?" The answer of course is no.
Does anyone have any numerical idea of what the G "duplicate-content" filter has been cranked up to? Someone posted that at some past time it was perceived as being at about 80% similarity. I suspect--though I cannot be sure or close to it--that by now it is operating below 50%, perhaps in the 40% range. I have modified a large (10,000+) set of site-index pages, which were, by some measuring tool on the web, coming in at 40% to 60% similarity (because even with minimal surrounding boilerplate, 100 one-to-three-word links are not a large part of any page's text), so that some extra nominally relevant (and download-time-wasting, thank you Google) material is tacked on; my new figures look like 20% to 30% similarity, so we'll see if G will start indexing them again.
For the past couple of weeks, I've had sites with many hundreds of pages of unique content stuck at about 200 pages of 'non-similar' content. I started these pages at the same time, and 'marketed' them PR-wise in similar ways. It's baffling to me that 3 separate sites with unique content all become stuck within 10 pages of 200 total 'non-similar' pages. Last week, Googelbot hit about 1,000 pages of one site and it doubled (convieniently) too appx. 400 'non-sim' pages. I'm starting to think that Google might have a limit based on PR, time since initial index, and maybe some other factors (for example, I don't think it is too difficult to structurally or semantically identify a blog or a directory) to determine the total number of 'non-similar' pages. I don't think it has anything to do with 'similarity' as we commonly define it. 100, 200, 400....maybe it;s just a coincidence, but it seems like a good way to make sure that content is legit before getting permanently indexed. My plan to improve this is to increase my PR. I know PR isn't real important for search results, but I could see G still using it to determine how deep or thorough a crawl of a site is. Anyway, just some thoughts. Nothing concrete to back it up, but it feels like a pattern to me.
"I'm starting to think that Google might have a limit based on PR, time since initial index, and maybe some other factors (for example, I don't think it is too difficult to structurally or semantically identify a blog or a directory) to determine the total number of 'non-similar' pages. I don't think it has anything to do with 'similarity' as we commonly define it." -------------- And I'm starting to think that Google has just flat-out gone off the rails. Somebody, somewhere within Google had a wet dream, and was highly enough placed to get it implemented. The results are insane and catastrophic, but when has that ever bothered Google? "We don't care--we don't have to." Sigh.
When Google says this at the end of a search: I do not think they are actually talking about comparing pages for similarities, but have elected to shorten the results in order to provide more relevant results and I suspect (though cannot prove) that this would probably be the result of the ranking time duplicate filter. Google actually has patented two "similar page" detection methods, one which it runs at ranking time and which is based on the similarity between SERPs listings (page title and snippets) and one which compares both sections of the page for similarities and the entire page for similarities. The second filter I suspect would be run on the index and the results precomputed and stored as "fingerprints". In short I suspect that this message and the ommission of some pages in the SERPs comes about as the result of the ranking time duplicate filter, and not the filter which excludes pages based on similar page content.