What can be the matter?

Owlcroft Peon

Messages:: 645

Likes Received:: 34

Best Answers:: 0

Trophy Points:: 0

#1

I have a site, omniknow.com, that is an online encyclopedia. It is a marrying of the Wikipedia open-source encyclopedia with the dmoz open-source links directory. I have an English version, plus Spanish, French, and German versions.

The site total page count is well over a million. Roughly 15,000 are actual pages, and the rest are php-generated, but with mod_rewrite conversion to static-form URLs.

Within a fairly reasonable time after launch, Google had gotten up to having indexed 248,000 pages (that is the average of ten datacenters each queried with the "site:" command; that was as of November 17th and 18th).

Since then, the indexed page count has dropped with ever-increasing speed; I am currently down to 49,400. Worse--possibly worst--I recently discovered that the site main (index) page is no longer in Google's archive. That is terrible, because the rest of the pages are reached only by way of a link chain that starts at the top: the site main (index) page links to 100 mid-level index pages, each of which links to 100 low-level pages, each of which links to (in the English-language set) about 75 or so individual-topic pages. There is some minimal cross-linking between the actual 750,000 or so topic pages, but not a lot: if a searchbot cannot find the topmost page, it is not likely to find any of the others. That is especially critical, because I recently changed the naming scheme for the mid- and low-level index pages, so as to keep the upper and lower topic names on each constant.

All PR-indicator sites I can find (I do not run IE, or Windoze, or any M$ products) each tell me that my site front page is presently PR 5. But I can not only not find it in the archives, but cannot get results for a search on a phrase in it. Curiously, all three of the non-English "front pages" are still indexed (and last archived on November 22nd).

OK, what's going on here?

The site is, so far as I can see--and by design--scrupulously compliant with all "white hat" criteria of both search engines and the content sources. The Wikipedia list of "fork" sites lists it as a "highly compliant" fork, with a "bonus" positive remark.

I emailed Google, explaining matters in some detail, and got what was rather obviously a form response.

This is absolutely killing me. Site Traffic follows virtually exactly the number of pages indexed in Google, and my AdSense revenue likewise. I had a nice little retirement income source going here, and suddenly the bottom falls out for no reason I can dream of.

Does anyone have any thoughts on what might be the matter, or on what I might be able to do about it?

As noted, the front page is at http://omniknow.com; if anyone wants to see my .htaccess file, I will gladly email it, or even post it.

I really need some help here . . . .

Owlcroft, Dec 4, 2004 IP

xml Peon

Messages:: 254

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 0

#2

Lotsa people on here seem to think google have turned up the duplicate content filter.

Maybe this is the source of the sites problems? Do you have many duplicate pages?

xml, Dec 5, 2004 IP

zamolxes Peon

Messages:: 176

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#3

Your home page (http://omniknow.com/) is a pr5 page. If you search for "http://omniknow.com/" in Google you realize they know is there but it has no cache and no title/description.
You seem to have a duplicate content problem for example pages like :

http://omniknow.com/pages/Topics1636.html and
http://omniknow.com/pages/Topics1643.html

They have ~43.85% similar content as a robot would see it.

I had a similar problem some time ago with one of my sites. In my experience if you have many pages with duplicate content, Google will gradually deindex them; for some reason it seems to also deindex (or show no cache/title/description) pages that are not strictly duplicate (like your home page) - I lost my home page (it was indexed with the url only, no cache/title/descr.) on that site as well. I didn't care much as I wasn't maintaining that site any longer so I left it to slowly die!

However I think you should really try to find a way to reduce the similarity between so many thousands of pages.

zamolxes, Dec 5, 2004 IP

nevetS Evolving Dragon

Messages:: 2,544

Likes Received:: 211

Best Answers:: 0

Trophy Points:: 135

#4

your main page only has two inbound links - one from this forum, and another from your german site. More inbound links might help.

nevetS, Dec 5, 2004 IP

zamolxes Peon

Messages:: 176

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#5

nevetS said:

your main page only has two inbound links - one from this forum, and another from your german site. More inbound links might help.
Click to expand...

That is not correct:

Google shows 3,270 links, Yahoo 422, MSN 143, MSN Beta 1,390

The more I look at your site the more obvious it is to me that this is a duplicate content problem.

zamolxes, Dec 5, 2004 IP

expat Stranger from a far land

Messages:: 873

Likes Received:: 18

Best Answers:: 0

Trophy Points:: 0

#6

another problem maybe that pages just don't have enough content.

http://omniknow.com/scripts/wiki.php?term=Wartenberg,_Hesse

just has one link no real content so why should it be indexed?
Another indication is that ADS is not reacting to this page but shows generic non specific ad's.
M

expat, Dec 5, 2004 IP

zamolxes Peon

Messages:: 176

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#7

The fact is, it is almost impossible to buid this type of sites without encountering such problems. Even ODP has plenty of duplicate content or pages with very little content. I guess the PR and backlinks more than compensate though!

zamolxes, Dec 5, 2004 IP

johncr Peon

Messages:: 127

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 0

#8

Owlcroft said:

The site total page count is well over a million. Roughly 15,000 are actual pages, and the rest are php-generated, but with mod_rewrite conversion to static-form URLs.
Click to expand...

Owlcroft, I agree with the opinions posted in this thread.
For what I have read everywhere in Internet, Google rates a site like yours as a link farmer or what is worse a "link-to-nowhere farmer".
Remember that nowadays Google is seeking CONTENT. I have clicked some of your links randomly and most of them link to blank pages or to another links-page.

Suggestions?
To begin with, delete all links to blank Wikipedia pages and try to add content, content and content.
Since Google likes links -but not so many links- I also suggest you stop bots to crawl some pages or to go too deep into your site. However, remember there are several bots like Googlebot/2.1, Googlebot-Image/1.0, Mediapartners-Google/2.1 (this is the AdsSense crawler), Googlebot/Test (Javascripts scanner?) and probably several other similar insects.

My 2 cents.

johncr, Dec 5, 2004 IP

Solicitors Mortgages Well-Known Member

Messages:: 2,217

Likes Received:: 139

Best Answers:: 0

Trophy Points:: 103

#9

maybe its cos you have pages like this...
h**p://omniknow.com/scripts/wiki.php?term=Catnip

looks a touch...erm...overlapped,,and 2ft wide in my IE6

yes thats right...i said IE6

how are you checking all of your links?
...i had a flick around and found quite a few that are dead

Solicitors Mortgages, Dec 5, 2004 IP

anthonycea Banned

Messages:: 13,378

Likes Received:: 342

Best Answers:: 0

Trophy Points:: 0

#10

He can run a link checker on the site to remove dead links, anyone got any software to suggest to Owlcroft?

anthonycea, Dec 5, 2004 IP

dazzlindonna Peon

Messages:: 553

Likes Received:: 21

Best Answers:: 0

Trophy Points:: 0

#11

xenu i believe is the name of a good free link checker.

dazzlindonna, Dec 5, 2004 IP

crew Peon

Messages:: 225

Likes Received:: 7

Best Answers:: 0

Trophy Points:: 0

#12

If your site does not provide any original content, why should Google want to crawl it? I googled some 7 or 8 word phrases from your pages and it shows up with 50 or more results. I assume this means there are 50 other pages out these (including wikipedia) with this exact same page content. I've noticed this with other sites as well about 10 days ago. I think Google realized there were probably hundreds of thousands of pages that are just exact copies of wikipedia and they really felt like there was no reason to continue to index them.

crew, Dec 5, 2004 IP

Owlcroft Peon

Messages:: 645

Likes Received:: 34

Best Answers:: 0

Trophy Points:: 0

#13

Well, let me make many replies all in one post.

You seem to have a duplicate content problem for example pages like . . . . They have ~43.85% similar content as a robot would see it.

The pages cited, like all the fully ten thousand pages in the Topicsxxxx.html series, contain a list of roughly 75 to 80 actual encyclopedia articles. Those lists, which are completely different on each of those ten thousand pages, are each necessarily surrounded by boilerplate "framing material" that is essentially the same from page to page--and, apparently, makes up about 44% of the page.

What we seem to be seeing here is, in effect, Google walking into a photograph gallery and insisting that all the photos are identical because each is in a similar frame. That is insanity, or that is Google, (if anyone thinks there is a difference between those two statements).

I think you should really try to find a way to reduce the similarity between so many thousands of pages.

That would seem to be the problem in a nutshell. But, in my mind, one of the chief merits of OmniKnow is that all of its pages, both directory and content, are quite "clean"--generous white space, and no extraneous material whatever. I could fairly easily make the index pages differ more by including large amounts of random crap--just like, for example, those hideously garish pages that Amazon serves up. But are we come to this, then, that we have to despoil our site pages to satisfy a brain-dead bot written by brain-dead fools masquerading as software designers?

another problem maybe that pages just don't have enough content. . . .

The pages have at least as much content as Wikipedia itself; is anyone saying that Wikipedia hasn't enough content on its pages? (I say "at least" because OmniKnow is not just a Wikipedia clone--see farther below).

Another indication is that ADS is not reacting to this page but shows generic non specific ad's.

AdSense will always show generic ads if its bot has not yet hit the page, and its bots--which are independent of Google's searchbots--will not hit a page till someone has landed on it and called up an ad display. With that million-plus page count, very obviously many of the pages (especially as the site is relatively new) will never have been hit by a real visitor who will see the JS ad displays. So that is not germane.

The fact is, it is almost impossible to build this type of sites without encountering such problems. Even ODP has plenty of duplicate content orpages with very little content.

Exactly. Any site that is mainly or wholly a resource lookup will, of the sheerest necessity, have constant, boilerplate "framing material". That is so of Wikipedia, the ODP, and zillions more. Manyof them, I notice, have--possibly to circumvent this very idiocy on Google's part--also, then, a ton of largely irrelevant garbage, typically in left and right columns, which garbage is, usually, just a highly annoying distraction from the real content of the page.

Remember that nowadays Google is seeking CONTENT. I have clicked someof your links randomly and most of them link to blank pages or to another links-page.

Are you following an internal link from a site page, or one of the many, many outdated wrong links that Google is not updating? In any event, the pages, and so the links to them, are neither more nor less "content-free" than the Wikipedia itself, and I don't believe anyone is suggesting that Wikipedia is a useless, low-content resource.

To begin with, delete all links to blank Wikipedia pages and try to add content, content and content.

I am unclear on how one would delete links to "blank" Wikipedia pages, as I do not know of any such (unless you mean stubs, which I think it important to retain for exactly the reasons Wikipedia has them), nor do I see how one would add content to an existing encyclopedia of nearly a million articles.

Since Google likes links -but not so many links- I also suggest you stopbots to crawl some pages or to go too deep into your site.

Hmmmm? The whole point is to get the site fully indexed by Google. I have been at some pains, and that is not a term used lightly here, toscrupulously follow Google's own oft-repeated suggestion of not having over 100 links on any one page. The top index page, which is the site front page, links to 100 mid-level index pages, each of which links to 100 low-level index pages, each of which links 75 to 80 actual topic pages. There is no way to index 750,000 to 800,000 pages on two levels without having close to a thousand links per index page, so three levels are mandatory.

maybe its cos you have pages like this...

h**p://omniknow.com/scripts/wiki.php?term=Catnip
looks a touch...erm...overlapped,,and 2ft wide in my IE6

That one is bizarre; I have it up for analysis, to see what is going wrong in the translation. Understand that the project is only two or three months old. It seems that every day present s a new page quirk needing custom handling (probably inescapable when the source material is wiki-made). Most issues I catch by monitoring the 404 log, but sometimes something like that one will slip through. It'll be fixed.

That said, I don't think that that one odd peculiarity--or any few pecularities in a million-plus pages--is the issue (especially since it seems highly unlikely that a googlebot is going to "see" that the screen is stretched horizontally by the display).

how are you checking all of your links?
...i had a flick around and found quite a few that are dead

Google (and other engine) links may well be dead, because they are not keeping up; but no internal link should ever come up "dead"--if you find on-page links that give a 404 message, please email me with any such.

He can run a link checker . . . .

Not on a million-plus links that can change nightly.

If your site does not provide any original content, why should Googlewant to crawl it? I googled some 7 or 8 word phrases from your pages and it shows up with 50 or more results. I assume this means there are 50other pages out these (including wikipedia) with this exact same page content. I've noticed this with other sites as well about 10 days ago. I think Google realized there were probably hundreds of thousands of pages that are just exact copies of wikipedia and they really felt likethere was no reason to continue to index them.

I think there are several things badly wrong with that analysis. First, I do not think that in any sense Google "wants" to crawl or not crawl a site; if the site does not trip any of their sheerly mechanical no-no alerts, it gets crawled. Google's business is to index the web, not to pass value judgements on content (except as to placement in SERPs). Are you implying that if any term whatever that you submit to Google shows more than, say 30 hits, all the sites below 30 are valueless garbage that ought to be taken out and shot? We'd end up with at most a few hundred sites making up the entire web. Google does not "realize" what pages are like which. Take a look at this list of Wikipedia "mirror/fork" sites:

http://en.wikipedia.org/wiki/User:JesseW/Full_mirror_list#OmniKnow
Do you reckon that Google does not index any of those save Wikipedia itself?

(You will notice that there, and at--

http://en.wikipedia.org/wiki/Wikipedia:Mirrors_and_forks/Mno#OmniKnow
--OmniKnow is listed as "highly compliant" with the Wikipedia license terms, and is even annotated "(bonus!) Links to Wikimedia fundraising page from each article.")

But what I think sticks sharpest in my craw is that no one seems to be noticing that the very raison d'etre of OmniKnow is that it is not--let me emphasize that, IS NOT--simply another Wikipedia clone. Its "value added" is that on every topic page it also provides, besides the Wikipedia article content, the results of a dmoz search on the topic term. That may seem a small and simple idea, but so are the safety pin and the paper clip, but people seem to find them useful. I would not have gone to the very great time and effort that OmniKnow represents just to present another clone of Wikipedia. In OmniKnow, the visitor--say perhaps a schoolchild doing research for homework or a paper--gets both the topic look as in Wikipedia and the full ODP results for sites relevant to that term.

Don't assume that because everyone here knows who and what both Wikipedia and dmoz are that the public at large knows either one. Despite the many explicit and obvious references and links to Wikipedia on OmniKnow, I daily get emails from persons who want the encyclopedia changed or augmented in vaious ways, thinking that it is entirely my creation. So I think that the marriage effected on every OmniKnow page is indeed "value added"--I do not claim omniscience, but I certainly do not know of any other site that provides that particular service.

So, in sum, I profoundly doubt Google decided "there was no reason to continue to index them."

To wrap this lengthy post: the best guess would seem to be that Google is so brain-dead that it cannot distinguish list/index pages from "duplicate-content" pages, and so is reckoning that the ten thousand index pages (of the English-language version) that follow their links-number guideline are all essentially the same page.

What is to do here, then? Any further suggestions?

Owlcroft, Dec 5, 2004 IP

T0PS3O Feel Good PLC

Messages:: 13,219

Likes Received:: 777

Best Answers:: 0

Trophy Points:: 0

#14

Not sure this might be part of the problem but wanted to mention it anyway...

A page like http://omniknow.com/scripts/wiki.php?term=Web_design keeps loading and loading and after minutes now is still adding links to the bottom. There's over 2000 now and still counting. Not very SE friendly and not too interesting fro visitors I'd think. Perhaps limit the number of ODP links.

T0PS3O, Dec 6, 2004 IP

disgust Guest

Messages:: 2,417

Likes Received:: 133

Best Answers:: 0

Trophy Points:: 0

#15

Owlcroft said:

What is to do here, then? Any further suggestions?
Click to expand...

two things I'd try:

1) I don't think the problem is that google sees YOUR pages, from one page to the next, as duplicate content "of itself." I think it sees it as duplicate content of wikipedia- or of other sites using wiki-data. to get around it? try to add even more variation from the original wiki-data.

2) get more inbound links. get tons of them. point them towards the index, towards the main "directory" pages, and maybe even some actual articles.

maybe sign up for the ad network and direct the ads at the pages mentioned above?

disgust, Dec 6, 2004 IP

Solicitors Mortgages Well-Known Member

Messages:: 2,217

Likes Received:: 139

Best Answers:: 0

Trophy Points:: 103

#16

OWLCROFT,
I certainly dont envy your task at hand. Its a huge resource and looks like hell on earth to co-ordinate.
You might want to consider running a links checker thru PARTS of your site on a 'periodical' basis perhaps. Obviously you cannot do all of it in one go...or very often...but any directory full of dead links suffers very quickly.
Thanks for your huge reply above, I respect someone who can take the time to post at such lengths.
Regards
*Gem*

Solicitors Mortgages, Dec 6, 2004 IP

Owlcroft Peon

Messages:: 645

Likes Received:: 34

Best Answers:: 0

Trophy Points:: 0

#17

two things I'd try:

1) I don't think the problem is that google sees YOUR pages, from one page to the next,
as duplicate content "of itself." I think it sees it as duplicate content of wikipedia- or of other sites using wiki-data. to get around it? try to add even more variation from
the original wiki-data.

2) get more inbound links. get tons of them. point them towards the index, towards the
main "directory" pages, and maybe even some actual articles.

A page like http://omniknow.com/scripts/wiki.php?term=Web_design keeps loading and loading and after minutes now is still adding links to the bottom. There's over 2000 now and still counting. Not very SE friendly and not too interesting fro visitors I'd think. Perhaps limit the number of ODP links.

Those two comments make interesting bookends. Most pages of actual content (as opposed to the many thousands of sheer index pages) are significantly different from the corresponding Wikipedia pages, exactly because of the "value added" ODP search results. Granted, of necessity some pages will have few--or, sometimes, no--ODP hits, depending on how arcane the topic term is, but on the average page there's a deal of non-duplicative "extra" content as compared to Wikipedia's own pages.

It is also true, at the other extreme, as noted, that some very popular terms will returna horde of ODP hits; but I don't see how that would be seen by any bot as objectionable. As to users, they can simply read the up-page topic text while the ODP links continue to load, or stop page loading--at least that's what I do (yes, I use my own product). The problem with limiting ODP link returns is that there's no guarantee (that I know of) that they are delivered most relevant first, as opposed to, say, a Google search.

Also, I do have quite a few inbound links, though it is impossible to say just how many. Yahoo, though--which appears, unlike Google, to filter out internal backlinks--reports over 900, and I think they're missing quite a few.

If there is a "duplicate content" penalty, I'd guess it's owing to the thousands of index pages (which, as I noted in an earlier post, cannot be avoided when one is indexing over a million pages), each being a simple list of 75 to 100 links "encased" in some boilerplate. It appears--and I emphasize "appears", because I'm still not 100% certain--that the ratio of boilerplate to actual page-to-page-differing content on those pages (since the content is simple lists) may be too high: some one said 47% page-to-page similarity.

I just got another "response" email from Google, which--as is usual with them--is about useless:

As previously mentioned, we strongly encourage you to review our Webmaster Guidelines at www.google.com/webmasters/guidelines.html. If you make changes to your site to comply with these guidelines, please let us know.
I don't know about anyone else, but my own opinion is that they carry this secrecy to comically absurd extremes; whatever in the world would they compromise by referring to some particular portion of those guidelines? Sigh . . . .

Anyway, it seems clear from that response that this is not some mechanical accident, but that they are indeed applying some sort of penalty. I suppose I will try adding to each directory page some load of needless excess baggage--say, a "sample" article chosen at random--to increase the percentage of non-boilerplate content.

What a sad joke.

Owlcroft, Dec 6, 2004 IP

Solicitors Mortgages Well-Known Member

Messages:: 2,217

Likes Received:: 139

Best Answers:: 0

Trophy Points:: 103

#18

<<I suppose I will try adding to each directory page some load of needless excess baggage--say, a "sample" article chosen at random--to increase the percentage of non-boilerplate content.>>

My sentiments exactly...i have thought also about adding half a page of unique waffle to 28K pages...as if this is going to assist the user in some way..PAH !

however i have just thought of an answer...PM me and i will share it with you..and you can let me know your thoughts
GEM

Solicitors Mortgages, Dec 6, 2004 IP

SEbasic Peon

Messages:: 6,317

Likes Received:: 318

Best Answers:: 0

Trophy Points:: 0

#19

Anyway, it seems clear from that response that this is not some mechanical accident, but that they are indeed applying some sort of penalty. I suppose I will try adding to each directory page some load of needless excess baggage--say, a "sample" article chosen at random--to increase the percentage of non-boilerplate content.
Click to expand...

If I were in your position, I would almost certianly add the amazon data too.

SEbasic, Dec 6, 2004 IP

crew Peon

Messages:: 225

Likes Received:: 7

Best Answers:: 0

Trophy Points:: 0

#20

Google's business is to index the web, not to pass value judgements on content (except as to placement in SERPs). Are you implying that if any term whatever that you submit to Google shows more than, say 30 hits, all the sites below 30 are valueless garbage that ought to be taken out and shot?
Click to expand...

That's exactly what I'm saying - with the assumption that if 30 pages all have the same unique 8 word phrase on them, that they are all probably the exact same complete page. Google's business is to enable people to find useful information. Indexing duplicate content does nothing to help Google's 'business'.

For example, I googled:

"the breaks -- luck and good fortune "

...from one of your pages. It returns 1-4 out of 36 with 32 being omitted for being too similar. I went through about 10 of the omitted pages and they are EXACTLY the same as Wikipedia's page. Now, explain to me how 32 EXACT copies of a page helps Google's business? Should our libraries start carrying 30 copies of all of their books?

We'd end up with at most a few hundred sites making up the entire web.
Click to expand...

I am saying duplicate pages (not duplicate words)

Google does not "realize" what pages are like which. Take a look at this list of Wikipedia "mirror/fork" sites:
Click to expand...

I completely disagree with the first statement. How many PhDs work at Google? It's not that difficult of a task. They most certainly can detect duplicate content - especially when they know what to start with (ie wikipedia clones).

As far as the list of mirrors...what exactly does that prove? That alot of other people thought it was a good idea to duplicate content? How does that make any of this useful to anybody? If you go through the list, you'll notice many other sites no longer have pages indexed:

articlehead: 1 page (out of what should be thousands)
artpolitic: probably less than 100 pages cached (out of 90,000+ google knows about)
infovoyager.com: lots of pages no show up without title and description

Just because Wikipedia says you are allowed to copy content, doesn't mean that A) Google has to index it or B) It is somehow useful to anybody.

crew, Dec 6, 2004 IP

Log in or Sign up

Advertising (learn more)

What can be the matter?

Owlcroft Peon

xml Peon

zamolxes Peon

nevetS Evolving Dragon

zamolxes Peon

expat Stranger from a far land

zamolxes Peon

johncr Peon

Solicitors Mortgages Well-Known Member

anthonycea Banned

dazzlindonna Peon

crew Peon

Owlcroft Peon

T0PS3O Feel Good PLC

disgust Guest

Solicitors Mortgages Well-Known Member

Owlcroft Peon

Solicitors Mortgages Well-Known Member

SEbasic Peon

crew Peon

Log in or Sign up

Advertising (learn more)

What can be the matter?

Owlcroft Peon

xml Peon

zamolxes Peon

nevetS Evolving Dragon

zamolxes Peon

expat Stranger from a far land

zamolxes Peon

johncr Peon

Solicitors Mortgages Well-Known Member

anthonycea Banned

dazzlindonna Peon

crew Peon

Owlcroft Peon

T0PS3O Feel Good PLC

disgust Guest

Solicitors Mortgages Well-Known Member

Owlcroft Peon

Solicitors Mortgages Well-Known Member

SEbasic Peon

crew Peon

Useful Searches