Hello, I was wondering does anybody know how google track duplicate pages over the sites ? Any idea how does they compare it ? Do they compare number of words over a page or the whole text ?
They look at the page structure. ( That includes everything- the look of the page, the content in it etc) The supplemental index doesn't exist now, but it used to be one way of checking whether your page was considered to be duplicate by Google.
It "probably" stores a page as a hash value that's determined by keyword placement, proximity, page size, words per page etc. When two pages have a matching or similar hash it's determined as duplicate.
I know there is a tool http://www.webconfs.com/similar-page-checker.php which check similar pages and gave you % of similarity. Any idea how to compare that tool to google analyzing ?
A couple of months ago I tried a couple of these 'pay x and have your article on 300 different blogs' offers that you see in the Buy, Sell or Trade forum. At first, Google webmaster tools showed up to 300 different inbound links from the same article to the pages that I wanted to point at. Over the couple of months since I had them published the different incoming links in Webmaster Tools have now been whittled down to dozens if not single figures.
That's a really simple way of determining duplicate content. you are not giving google enough credit, I'm sure they have a complex enough algorithm to determine whether parts of an individual paragraph is a duplicate...
Duplicate documents and Supplementals have very little correlation, except having duplicates usually results in less people linking to your version.
that script which I show you is not really that simple, its not counting words , try to play with it , and you will see that is not that simple.
Having duplicate content was not the only reason why sites landed up in the supplemental index. Weak pages, having no authority and weight, used to land there too.
Well, I would say its more like duplicate content, at least for me I know for sure that it is duplicate content.
So any idea how to find similar tool which check duplicate content like google or compare tool that I show you at the top to google ? find something in the middle
No thanks...I don't have the spare time to look at the script, plus google has more complex algorithm and a lot more processing power than any simple script can and ever will........I would be surprised if any such tool existed, no one knows how google checks for dupes....I have seen some people use www.copyscape.com to check for dupes but it is right a low percentage of the time.
www.copyscape.com Well that site is not always find duplicate content, I think they doing more like counting words
As i said, take an unique string of text from the document and Google it in quotes. Google will show all pages containing that exact string.