Billion page site, The test projetc.

Discussion in 'Search Engine Optimization' started by jocs, Jul 9, 2005.

  1. #1
    I'm currently programming a full billion page site, to test many things about SE spidering, min and max keywords in a page, importance of anchor text, nº of links in a page, position of the elements etc etc...

    I have made 13 frequency sorted huge lists of common english words: verbs, adjectives, adverbs, nouns, etc. The idea is to generat the content of the page dependign exclusively from the URL entered, in that way we don't need to generate by hand a billion pages, PHP will do it for us.

    Have 1 php that generates the full site, and start testing.

    Woluld like to hear your opinion, also if you are interested in helping The test project, or interested in results.

    edit: my prase generating test here
     
    jocs, Jul 9, 2005 IP
  2. ferret77

    ferret77 Heretic

    Messages:
    5,276
    Likes Received:
    230
    Best Answers:
    0
    Trophy Points:
    0
    #2
    I have built sites that bascially generate a infinte loop of content to see what happens also.

    In my experience the amount of pages google will spider in seems to be limited to the link populariyt of the site
     
    ferret77, Jul 9, 2005 IP
  3. jocs

    jocs Peon

    Messages:
    103
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #3
    would be nice to see your content-loop site, its up?

    I'll give it 90% of my co-op links, when its done. but what I really want is to make this test with more other people, because my ideas can be wrong, or maybe I'll miss something important,
     
    jocs, Jul 9, 2005 IP
  4. crazyhorse

    crazyhorse Peon

    Messages:
    1,137
    Likes Received:
    19
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Thats my two cents as well. I have noticed that as well im currently stuck at my maximum of 180 k pages indexed for Google. Maybe its a PR limit as well. But i would defenitely like to hear more and see more about your project. Keep us updated.
     
    crazyhorse, Jul 9, 2005 IP
  5. jocs

    jocs Peon

    Messages:
    103
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #5
    I'll post here the url when its done,
    what do you think of my phrase generator?
    do anyone have a better idea on generating phrases? (mine here)
     
    jocs, Jul 9, 2005 IP
  6. crazyhorse

    crazyhorse Peon

    Messages:
    1,137
    Likes Received:
    19
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Where are you going to pull the info from? Will it be original written content?
     
    crazyhorse, Jul 9, 2005 IP
  7. jocs

    jocs Peon

    Messages:
    103
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #7
    It will be machine generated content, acording to english language statistical data. I mean that if in some case any SE analyzes the content looking for any suspicious thing, it will find only perfect statistical english, i.e. if the particle "the" has to be the 4% of all texts, statistically, it will find so. But text will have no sense at all, for human. (see phrase generator to get an idea)

    The content will depend exclusively to de URL, if you enter www,thedomain.com/000/000/000/001 you will allways find the same contet, same distribution and same links to the same place.
     
    jocs, Jul 9, 2005 IP
  8. crazyhorse

    crazyhorse Peon

    Messages:
    1,137
    Likes Received:
    19
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Isnt that cloaking?
     
    crazyhorse, Jul 9, 2005 IP
  9. web-rover

    web-rover Peon

    Messages:
    1,341
    Likes Received:
    45
    Best Answers:
    0
    Trophy Points:
    0
    #9
    that's pretty interesting
     
    web-rover, Jul 9, 2005 IP
  10. frankm

    frankm Active Member

    Messages:
    915
    Likes Received:
    63
    Best Answers:
    0
    Trophy Points:
    83
    #10
    @JOCS: this is a great test. Google claims to have index 8 x 10^9 pages. So your site would eat up 11% of google ... something tells me google is not going to index all your pages,

    maybe you can keep this thread alive by posting the actually spidered (check your logs) pages and actually indexed (site:cmd) pages. See how that compares

    good luck - if you need any co-op weight, willing to point some of mine to your site
     
    frankm, Jul 9, 2005 IP
  11. jocs

    jocs Peon

    Messages:
    103
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #11
    Sure not, i've searched the web for a nice short definition of cloaking:

    Cloaking:The practice of allowing users to see one version of your website, while showing the Robots, Crawlers and Spiders something else.

    My site will show allways the same content for the same url to anybody who looks at it.

    In 2 or 3 days I'll have some alpha version of the site, i'll post it as soon as i have it.


    @frankm: Thanx for your support. Hope will have the site soon!
     
    jocs, Jul 9, 2005 IP
  12. dct

    dct Finder of cool gadgets

    Messages:
    3,132
    Likes Received:
    328
    Best Answers:
    0
    Trophy Points:
    230
    #12
    In order to do some SE test and research you are going to create a billion pages of spam, useless pages to humans but attempted traps for spiders? I agree with R & D but this just seems like another billion or more pages of useless crap
     
    dct, Jul 9, 2005 IP
  13. nevetS

    nevetS Evolving Dragon

    Messages:
    2,544
    Likes Received:
    211
    Best Answers:
    0
    Trophy Points:
    135
    #13
    I have to agree with you dct.
     
    nevetS, Jul 9, 2005 IP
  14. ferret77

    ferret77 Heretic

    Messages:
    5,276
    Likes Received:
    230
    Best Answers:
    0
    Trophy Points:
    0
    #14
    Yeah pretty much, its how I pay my rent.

    Mine is sort of useful becasue it displays actual information, that people seem to look up.

    Mines up to 39,000 pages so far. I guess I will se how far it goes.

    I would love to use a screen scaper to get more content, but people go crazy and email death threats and such. I don't need the hassles.
     
    ferret77, Jul 9, 2005 IP
  15. jocs

    jocs Peon

    Messages:
    103
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #15
    Sorry about trying to understand things.

    I'm happy you've got 39,000 pages, but I haven't got that large number of human interesting content.

    And "like another billion or more pages of useless crap", maybe another billion pages of useless crap, but maybe also I'll be able to learn something.

    Why you are so disgusted? Whoud be better if I steal content of other webs instead of generating something with big title saying ITS A TEST- LEAVE NOW?
     
    jocs, Jul 9, 2005 IP
  16. dvduval

    dvduval Notable Member

    Messages:
    3,372
    Likes Received:
    356
    Best Answers:
    1
    Trophy Points:
    260
    #16
    Even though you have an infinite loop af keywords, I think you would also need lots of internal links and variety of anchor text, as well as variety page length. Also, is the pages are changing upon refresh, that may be an issue too. Google has all sorts of buffers in place to find patterns. Once they reach a certain point (in number of patterns), they seem to set a max limit of indexed pages.
     
    dvduval, Jul 9, 2005 IP
  17. jocs

    jocs Peon

    Messages:
    103
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #17
    We can give links to that point, and then it will be like another starting point, maybe.
     
    jocs, Jul 10, 2005 IP
  18. jocs

    jocs Peon

    Messages:
    103
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #18
    jocs, Jul 10, 2005 IP
  19. crazyhorse

    crazyhorse Peon

    Messages:
    1,137
    Likes Received:
    19
    Best Answers:
    0
    Trophy Points:
    0
    #19
    Just visited your website but none of the links seem to be working. Am i missing something here?
     
    crazyhorse, Jul 11, 2005 IP
  20. GuyFromChicago

    GuyFromChicago Permanent Peon

    Messages:
    6,728
    Likes Received:
    529
    Best Answers:
    0
    Trophy Points:
    0
    #20
    I would assume that's the reason why some of the links are not yet working.
     
    GuyFromChicago, Jul 11, 2005 IP