Billion page site, The test projetc.

jocs Peon

Messages:: 103

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 0

#1

I'm currently programming a full billion page site, to test many things about SE spidering, min and max keywords in a page, importance of anchor text, nÂº of links in a page, position of the elements etc etc...

I have made 13 frequency sorted huge lists of common english words: verbs, adjectives, adverbs, nouns, etc. The idea is to generat the content of the page dependign exclusively from the URL entered, in that way we don't need to generate by hand a billion pages, PHP will do it for us.

Have 1 php that generates the full site, and start testing.

Woluld like to hear your opinion, also if you are interested in helping The test project, or interested in results.

edit: my prase generating test here

jocs, Jul 9, 2005 IP

ferret77 Heretic

Messages:: 5,276

Likes Received:: 230

Best Answers:: 0

Trophy Points:: 0

#2

I have built sites that bascially generate a infinte loop of content to see what happens also.

In my experience the amount of pages google will spider in seems to be limited to the link populariyt of the site

ferret77, Jul 9, 2005 IP

jocs Peon

Messages:: 103

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 0

#3

would be nice to see your content-loop site, its up?

I'll give it 90% of my co-op links, when its done. but what I really want is to make this test with more other people, because my ideas can be wrong, or maybe I'll miss something important,

jocs, Jul 9, 2005 IP

crazyhorse Peon

Messages:: 1,137

Likes Received:: 19

Best Answers:: 0

Trophy Points:: 0

#4

ferret77 said:

I have built sites that bascially generate a infinte loop of content to see what happens also.

In my experience the amount of pages google will spider in seems to be limited to the link populariyt of the site
Click to expand...

Thats my two cents as well. I have noticed that as well im currently stuck at my maximum of 180 k pages indexed for Google. Maybe its a PR limit as well. But i would defenitely like to hear more and see more about your project. Keep us updated.

crazyhorse, Jul 9, 2005 IP

jocs Peon

Messages:: 103

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 0

#5

I'll post here the url when its done,
what do you think of my phrase generator?
do anyone have a better idea on generating phrases? (mine here)

jocs, Jul 9, 2005 IP

crazyhorse Peon

Messages:: 1,137

Likes Received:: 19

Best Answers:: 0

Trophy Points:: 0

#6

Where are you going to pull the info from? Will it be original written content?

crazyhorse, Jul 9, 2005 IP

jocs Peon

Messages:: 103

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 0

#7

It will be machine generated content, acording to english language statistical data. I mean that if in some case any SE analyzes the content looking for any suspicious thing, it will find only perfect statistical english, i.e. if the particle "the" has to be the 4% of all texts, statistically, it will find so. But text will have no sense at all, for human. (see phrase generator to get an idea)

The content will depend exclusively to de URL, if you enter www,thedomain.com/000/000/000/001 you will allways find the same contet, same distribution and same links to the same place.

jocs, Jul 9, 2005 IP

crazyhorse Peon

Messages:: 1,137

Likes Received:: 19

Best Answers:: 0

Trophy Points:: 0

#8

jocs said:

It will be machine generated content, acording to english language statistical data. I mean that if in some case any SE analyzes the content looking for any suspicious thing, it will find only perfect statistical english, i.e. if the particle "the" has to be the 4% of all texts, statistically, it will find so. But text will have no sense at all, for human. (see phrase generator to get an idea)

The content will depend exclusively to de URL, if you enter www,thedomain.com/000/000/000/001 you will allways find the same contet, same distribution and same links to the same place.
Click to expand...

Isnt that cloaking?

crazyhorse, Jul 9, 2005 IP

web-rover Peon

Messages:: 1,341

Likes Received:: 45

Best Answers:: 0

Trophy Points:: 0

#9

that's pretty interesting

web-rover, Jul 9, 2005 IP

frankm Active Member

Messages:: 915

Likes Received:: 63

Best Answers:: 0

Trophy Points:: 83

#10

@JOCS: this is a great test. Google claims to have index 8 x 10^9 pages. So your site would eat up 11% of google ... something tells me google is not going to index all your pages,

maybe you can keep this thread alive by posting the actually spidered (check your logs) pages and actually indexed (site:cmd) pages. See how that compares

good luck - if you need any co-op weight, willing to point some of mine to your site

frankm, Jul 9, 2005 IP

jocs Peon

Messages:: 103

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 0

#11

crazyhorse said:

Isnt that cloaking?
Click to expand...

Sure not, i've searched the web for a nice short definition of cloaking:

Cloaking:The practice of allowing users to see one version of your website, while showing the Robots, Crawlers and Spiders something else.

My site will show allways the same content for the same url to anybody who looks at it.

In 2 or 3 days I'll have some alpha version of the site, i'll post it as soon as i have it.

@frankm: Thanx for your support. Hope will have the site soon!

jocs, Jul 9, 2005 IP

dct Finder of cool gadgets

Messages:: 3,132

Likes Received:: 328

Best Answers:: 0

Trophy Points:: 230

#12

In order to do some SE test and research you are going to create a billion pages of spam, useless pages to humans but attempted traps for spiders? I agree with R & D but this just seems like another billion or more pages of useless crap

dct, Jul 9, 2005 IP

nevetS Evolving Dragon

Messages:: 2,544

Likes Received:: 211

Best Answers:: 0

Trophy Points:: 135

#13

I have to agree with you dct.

nevetS, Jul 9, 2005 IP

ferret77 Heretic

Messages:: 5,276

Likes Received:: 230

Best Answers:: 0

Trophy Points:: 0

#14

In order to do some SE test and research you are going to create a billion pages of spam, useless pages to humans but attempted traps for spiders? I agree with R & D but this just seems like another billion or more pages of useless crap
Click to expand...

Yeah pretty much, its how I pay my rent.

Mine is sort of useful becasue it displays actual information, that people seem to look up.

Mines up to 39,000 pages so far. I guess I will se how far it goes.

I would love to use a screen scaper to get more content, but people go crazy and email death threats and such. I don't need the hassles.

ferret77, Jul 9, 2005 IP

jocs Peon

Messages:: 103

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 0

#15

Sorry about trying to understand things.

I'm happy you've got 39,000 pages, but I haven't got that large number of human interesting content.

And "like another billion or more pages of useless crap", maybe another billion pages of useless crap, but maybe also I'll be able to learn something.

Why you are so disgusted? Whoud be better if I steal content of other webs instead of generating something with big title saying ITS A TEST- LEAVE NOW?

jocs, Jul 9, 2005 IP

dvduval Notable Member

Messages:: 3,372

Likes Received:: 356

Best Answers:: 1

Trophy Points:: 260

#16

Even though you have an infinite loop af keywords, I think you would also need lots of internal links and variety of anchor text, as well as variety page length. Also, is the pages are changing upon refresh, that may be an issue too. Google has all sorts of buffers in place to find patterns. Once they reach a certain point (in number of patterns), they seem to set a max limit of indexed pages.

dvduval, Jul 9, 2005 IP

jocs Peon

Messages:: 103

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 0

#17

We can give links to that point, and then it will be like another starting point, maybe.

jocs, Jul 10, 2005 IP

jocs Peon

Messages:: 103

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 0

#18

Uploaded the lastest version, including layout and newer text generating algorithm.

Here ->Billion pages - The test project

jocs, Jul 10, 2005 IP

crazyhorse Peon

Messages:: 1,137

Likes Received:: 19

Best Answers:: 0

Trophy Points:: 0

#19

jocs said:

Uploaded the lastest version, including layout and newer text generating algorithm.

Here ->Billion pages - The test project
Click to expand...

Just visited your website but none of the links seem to be working. Am i missing something here?

crazyhorse, Jul 11, 2005 IP

GuyFromChicago Permanent Peon

Messages:: 6,728

Likes Received:: 529

Best Answers:: 0

Trophy Points:: 0

#20

The website is only for testing purposes and that most sections aren't completed yet.
Click to expand...

I would assume that's the reason why some of the links are not yet working.

GuyFromChicago, Jul 11, 2005 IP

Log in or Sign up

Billion page site, The test projetc.

jocs Peon

ferret77 Heretic

jocs Peon

crazyhorse Peon

jocs Peon

crazyhorse Peon

jocs Peon

crazyhorse Peon

web-rover Peon

frankm Active Member

jocs Peon

dct Finder of cool gadgets

nevetS Evolving Dragon

ferret77 Heretic

jocs Peon

dvduval Notable Member

jocs Peon

jocs Peon

crazyhorse Peon

GuyFromChicago Permanent Peon

Useful Searches