Billion page site, The test projetc.

Discussion in 'Search Engine Optimization' started by jocs, Jul 9, 2005.

  1. Perrow

    Perrow Well-Known Member

    Messages:
    1,306
    Likes Received:
    78
    Best Answers:
    0
    Trophy Points:
    140
    #41
    Your welcome (though I'm a bit concerned that I actually helped someone produce black-hat code :eek: ).

    The most interesting thing about it, SEO wise, is that if you feed it keyword rich text, it will produce keyword rich text (please note that I object to this form of generated content).

    You, and all other programmers, should also try the other challenges on Prag Dave's site, and do read up on his explanation of why you should link. The basic reasoning is that in many other areas where skill is needed practitioners spend at least some time exercising, and that this might be useful for programmers as well. I think everybody on this forum would benefit from reading the introduction of the above linked page. It can certainly be applied to most fields, not just programming.
     
    Perrow, Jul 13, 2005 IP
  2. tomecki

    tomecki Peon

    Messages:
    369
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #42
    Thanks for source code. I will have fun with it.
     
    tomecki, Jul 13, 2005 IP
  3. wwwbug

    wwwbug Peon

    Messages:
    296
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #43
    Does your site work?
     
    wwwbug, Jul 14, 2005 IP
  4. ukmp3

    ukmp3 Peon

    Messages:
    133
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #44
    ukmp3, Jul 14, 2005 IP
  5. jocs

    jocs Peon

    Messages:
    103
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #45
    Its still under constructuin, we started the project a week ago, I wish It woud be finished, but there are many things that still need a lot of mind work, and also programming.
    There are parts working, but the main part is still in development.

    The test is on a testing url, and there is only 1 public page. In a week or so It will have a domain name for it.

    Woud be better if you read all the previous pages.
     
    jocs, Jul 14, 2005 IP
  6. sji2671

    sji2671 Self Made Mind

    Messages:
    1,991
    Likes Received:
    146
    Best Answers:
    0
    Trophy Points:
    170
    #46
    Interesting to watch this, I just fought with google with one of my new sites that had loads of pages droppped by google and sat around 100,000 indexed but today she is back up to over 800,000 so I am aiming for 1 million shortly.
     
    sji2671, Jul 17, 2005 IP
  7. stuw

    stuw Peon

    Messages:
    702
    Likes Received:
    44
    Best Answers:
    0
    Trophy Points:
    0
    #47
    it would be interesting if a number of billion page sites sprang up - how would that effect the way the guys at Google plan to spider the web? Or do you think they would just get ignored. Interesting - I'm wondereing how many pages the 'world library' they are planning would take up...
     
    stuw, Jul 17, 2005 IP
  8. jocs

    jocs Peon

    Messages:
    103
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #48
    Now I'm in the hardest part of work, deciding wich will be the page structures:
    For those who know something of programming, I need your opinion:

    main idea:

    getting url, disglosing it, and calculating the pages tructure:

    url will be www.domain.com/001/001/001/001
    The main idea is to use the versatility of MD5(see last lines to know whats md5 hash) hash function to ask and repply:

    here is what I mean:
    function ask_md5($what, $min, $max) {
    $question=md5($what);
    $question=crc32($question);
    $question=hexdec($question);
    //$question=hexdec( crc32(md5($what)));
    $answer=$min+($question % ($max-$min)); // using modulus % function 
    echo "If u ask \"$what\", from $min to $max, computer says \"$answer\"";
    // return $answer;
    }
    
    PHP:

    And you can ask for anything, that will allways return the same number for the same question:

    ask_md5("Which number of paragraphs we will have in http://{$HTTP_HOST}{$REQUEST_URI}",3,100);
    
    PHP:
    Here we have the number of paragraphs depending on the url.
    we can do the same in a deeper way:

    
    $terms=array('noun','adj','adv','verb','4n-gram','5n-gram','6n-gram'
                      ,'7n-gram','8n-gram','det','2n-det','verb2');
    
    ask_md5("Type of word in paragraph 2, line 1, word 33 in http://{$HTTP_HOST}{$REQUEST_URI}",0,12);
    
    PHP:
    Whe can just ask as many questions as we want, and then generate the page depending on the results.
    I think this system will be one of the easyest ways to return the same words in the same url, but any better or smarter way to do it?

    Try the code, lets hear your opinion!
     
    jocs, Jul 18, 2005 IP
  9. prowess

    prowess Guest

    Messages:
    159
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #49
    I'd love to see the code. If you don't mind sharing maybe I can help you out.
     
    prowess, Jul 19, 2005 IP
  10. kalius

    kalius Peon

    Messages:
    599
    Likes Received:
    27
    Best Answers:
    0
    Trophy Points:
    0
    #50
    Are you triying to create semanticaly corect text?

    I want to create a good auto-text generator too, have you looked at any of the black hat tools for ideas?
     
    kalius, Jul 19, 2005 IP
  11. isaiasd2003

    isaiasd2003 Guest

    Messages:
    216
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #51
    I've done a little(actually about 3 yrs) homework, and learned, most search engines do see all and or most pages from a site-clue*. Theres something I know that spammers wish they knew though, I'm not gonna tell, EVER. I want to rank high thanks to all that hard work it took me to learn how to rank high the right way. All would be pissed away if spammers got ahold of such information, they'd just take over. I'm already experiencing some of the effects from spammer attacks which try ruling my keywords. If I got you all confused, sorry. To sum it up, I'm not a spammer, though I study their techniques, so in case a spammer tries taking my rank, I'll know how to go about out smarting him/her without resorting to spam. =D
     
    isaiasd2003, Jul 22, 2005 IP
  12. isaiasd2003

    isaiasd2003 Guest

    Messages:
    216
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #52
    WOHOO! my 100th point! Time to Party!:cool: ;) :D :rolleyes:
     
    isaiasd2003, Jul 22, 2005 IP
  13. crazyhorse

    crazyhorse Peon

    Messages:
    1,137
    Likes Received:
    19
    Best Answers:
    0
    Trophy Points:
    0
    #53
    Well looks as if you took over some habits of the spammers. |Your spamming the forum............ Something else isn't a forum about sharing ideas with one other. So get the story going on how you can rank well after studying search engine behaviour for three years.
     
    crazyhorse, Jul 22, 2005 IP
  14. kdb003

    kdb003 Active Member

    Messages:
    150
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    73
    #54
    I am curious as to how you are going to get each word. Are you going to do a db query on a large table of n-grams/words for each word on each page. That would be a lot of db queries.
     
    kdb003, Jul 24, 2005 IP
  15. jocs

    jocs Peon

    Messages:
    103
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #55
    Sorry about not actualitzating the post, I've been hard working many days, in a few days we we'll have the test site version 0000.0000001, but it will work.

    Yes, I'll post the source code here, and also with coments, explanations and wich was the idea for each lines of php. Maybe some of you can help improving it.


    I've been looking for the best way to generate the content, but the hard thing is that have to be generated with only 12 parameters (the Id of the page, in url /000/000/000/000) and with this parameters you shoud have ALL the info needed to generate always the same page with same links, same paragraphs for each URL. I think I've found the way with the function ask_md5, that gives you an answer from any question, numerically, from MIN to MAX, and its pretty fast.
    Have anyone of you tested the ask_md5 function? Did you understand it? (not PHP code, the essence of the function)
     
    jocs, Jul 26, 2005 IP
  16. jocs

    jocs Peon

    Messages:
    103
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #56
    hi folks,
    I've finally got enough time to finish the first release of code that works. I've uploaded the source and added mod rewrite rules to work with styles.

    If you want to take a look: Here
     
    jocs, Aug 10, 2005 IP
  17. isaiasd2003

    isaiasd2003 Guest

    Messages:
    216
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #57
    hasn't cnn.com already done this? lol Hey is there some way I can find out how many pages a site has? Some sort of:cool: tool?
     
    isaiasd2003, Aug 20, 2005 IP
  18. jocs

    jocs Peon

    Messages:
    103
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #58
    There are two ways:
    -A spider tool: Some script that will read a page, and follow all links, read all links from tha page and follow them. If you restrict to the same domain, you'll know how many pages does it have, but it may waste a lot of bandwith, and it will show on site statistics that someone has spidered the site, in worst case, you'll get banned for massive downloading site.
    -Site command in search engines, the only point is that will only show you the indexed pages of the site, that may be lower or equal to the total pages.
     
    jocs, Aug 22, 2005 IP
  19. tomecki

    tomecki Peon

    Messages:
    369
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #59
    Ok, it is working. Can you tell us how did you generate the text?
     
    tomecki, Sep 20, 2005 IP