I'm currently programming a full billion page site, to test many things about SE spidering, min and max keywords in a page, importance of anchor text, nº of links in a page, position of the elements etc etc... I have made 13 frequency sorted huge lists of common english words: verbs, adjectives, adverbs, nouns, etc. The idea is to generat the content of the page dependign exclusively from the URL entered, in that way we don't need to generate by hand a billion pages, PHP will do it for us. Have 1 php that generates the full site, and start testing. Woluld like to hear your opinion, also if you are interested in helping The test project, or interested in results. edit: my prase generating test here
I have built sites that bascially generate a infinte loop of content to see what happens also. In my experience the amount of pages google will spider in seems to be limited to the link populariyt of the site
would be nice to see your content-loop site, its up? I'll give it 90% of my co-op links, when its done. but what I really want is to make this test with more other people, because my ideas can be wrong, or maybe I'll miss something important,
Thats my two cents as well. I have noticed that as well im currently stuck at my maximum of 180 k pages indexed for Google. Maybe its a PR limit as well. But i would defenitely like to hear more and see more about your project. Keep us updated.
I'll post here the url when its done, what do you think of my phrase generator? do anyone have a better idea on generating phrases? (mine here)
It will be machine generated content, acording to english language statistical data. I mean that if in some case any SE analyzes the content looking for any suspicious thing, it will find only perfect statistical english, i.e. if the particle "the" has to be the 4% of all texts, statistically, it will find so. But text will have no sense at all, for human. (see phrase generator to get an idea) The content will depend exclusively to de URL, if you enter www,thedomain.com/000/000/000/001 you will allways find the same contet, same distribution and same links to the same place.
@JOCS: this is a great test. Google claims to have index 8 x 10^9 pages. So your site would eat up 11% of google ... something tells me google is not going to index all your pages, maybe you can keep this thread alive by posting the actually spidered (check your logs) pages and actually indexed (site:cmd) pages. See how that compares good luck - if you need any co-op weight, willing to point some of mine to your site
Sure not, i've searched the web for a nice short definition of cloaking: Cloaking:The practice of allowing users to see one version of your website, while showing the Robots, Crawlers and Spiders something else. My site will show allways the same content for the same url to anybody who looks at it. In 2 or 3 days I'll have some alpha version of the site, i'll post it as soon as i have it. @frankm: Thanx for your support. Hope will have the site soon!
In order to do some SE test and research you are going to create a billion pages of spam, useless pages to humans but attempted traps for spiders? I agree with R & D but this just seems like another billion or more pages of useless crap
Yeah pretty much, its how I pay my rent. Mine is sort of useful becasue it displays actual information, that people seem to look up. Mines up to 39,000 pages so far. I guess I will se how far it goes. I would love to use a screen scaper to get more content, but people go crazy and email death threats and such. I don't need the hassles.
Sorry about trying to understand things. I'm happy you've got 39,000 pages, but I haven't got that large number of human interesting content. And "like another billion or more pages of useless crap", maybe another billion pages of useless crap, but maybe also I'll be able to learn something. Why you are so disgusted? Whoud be better if I steal content of other webs instead of generating something with big title saying ITS A TEST- LEAVE NOW?
Even though you have an infinite loop af keywords, I think you would also need lots of internal links and variety of anchor text, as well as variety page length. Also, is the pages are changing upon refresh, that may be an issue too. Google has all sorts of buffers in place to find patterns. Once they reach a certain point (in number of patterns), they seem to set a max limit of indexed pages.
Uploaded the lastest version, including layout and newer text generating algorithm. Here ->Billion pages - The test project