1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Billion page site, The test projetc.

Discussion in 'Search Engine Optimization' started by jocs, Jul 9, 2005.

  1. fryman

    fryman Kiss my rep

    Messages:
    9,604
    Likes Received:
    777
    Best Answers:
    0
    Trophy Points:
    370
    #21
    Hmmm... I am suffering trying to get google to index my pages, and here they fill the net with a bunch of garbage and get indexed without problems...
     
    fryman, Jul 11, 2005 IP
  2. Dominic

    Dominic Well-Known Member

    Messages:
    1,725
    Likes Received:
    121
    Best Answers:
    0
    Trophy Points:
    185
    #22
    Interesting, keep updating us hey.
     
    Dominic, Jul 11, 2005 IP
  3. jocs

    jocs Peon

    Messages:
    103
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #23
    Project site its still not indexed, but I have to buy a domain for it when its finished, an then I'll put links to it.

    and The only thing that works now is the "Computer generated text:" part, that when you reload the page generates new text. There are no links working, because its only layout (also links, that all of them are href=#) and text generating script test, at the moment.

    This week I'll finish the script that generates full page content depending on url.

    what do you think of generated text? (I know that somtimes it generates really stupid phrases)
     
    jocs, Jul 11, 2005 IP
  4. honey

    honey Prominent Member

    Messages:
    15,555
    Likes Received:
    712
    Best Answers:
    0
    Trophy Points:
    325
    #24
    It will be interesting to see when it's up. I have been working on a real billion pages indexed in google and yahoo project :)
     
    honey, Jul 11, 2005 IP
  5. frankm

    frankm Active Member

    Messages:
    915
    Likes Received:
    63
    Best Answers:
    0
    Trophy Points:
    83
    #25
    I got a couple of sites with 500k indexed pages. that seems to be the limit for me :)
     
    frankm, Jul 12, 2005 IP
  6. ferret77

    ferret77 Heretic

    Messages:
    5,276
    Likes Received:
    230
    Best Answers:
    0
    Trophy Points:
    0
    #26
    The most I have very had was like 200k
     
    ferret77, Jul 12, 2005 IP
  7. jocs

    jocs Peon

    Messages:
    103
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #27
    You are big! I have no more than 12K, but until the project goes on I have only Spanish sites. But I'm sure we'll be able to reach the limit with this test. I think I'm not missing any important points.

    Other thing: I'm currently documenting the project, source code, explanation, working examples, to show you The Project in a more serious way.
    Tomorrow I'll post the source code of the text generating script, maybe someone of you can improve or find a bug on it. In few days I'll have the documentation in PDF, so it will be easier to explain, in a graphical way. (I tried to write at least 10 post explaining the idea, but its too hard to make it easy to understand without graphics and arrows and ...)


    P.D.: For all of you that think that we'll generate 1 billion pages of spam, its ok, we don't need to read more posts about that. Thanks for the advice. We don't try to spam anyone or anything, we are learning, trying, understanding, prooving, testing....not spaming.
     
    jocs, Jul 12, 2005 IP
  8. isaiasd2003

    isaiasd2003 Guest

    Messages:
    216
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #28
    man, thats gotta take alot of time, and might be some of the reason why google is forced to screw us real webmasters with lower ranks, etc. I don't believe you need to spam the engines to understand G. I never had to, all I did is hard work, understanding how to opt my sites, get backlinks, and read all I can on new stuff. Whatever you find out may become useless because G changes every now and then, and you'll be at square one again. As for me, I'd be screwed, having to learn new things G requires of a site just to be accepted.
     
    isaiasd2003, Jul 12, 2005 IP
    Cyclops likes this.
  9. fryman

    fryman Kiss my rep

    Messages:
    9,604
    Likes Received:
    777
    Best Answers:
    0
    Trophy Points:
    370
    #29
    What I still don't understand is what is the use of this ridiculous "test"? Why would anyone even care if google is able to index a billion pages or not? If you are going to do tests, do it for something useful, go test any of the million google myths out there, but filling the net with more garbage and calling it an interesting test is absurd.
     
    fryman, Jul 12, 2005 IP
  10. kdb003

    kdb003 Active Member

    Messages:
    150
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    73
    #30
    i think generating random pages can be fun and interesting, but just don't link to them.
    If I were outputting some random text I would combine a few translators and random word generators(ie my text generator and my word replacer)
     
    kdb003, Jul 12, 2005 IP
  11. tomecki

    tomecki Peon

    Messages:
    369
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #31
    What kind of scripts are you using for generating text? I like perl module "Silly::Wabby" from CPAN and you?
     
    tomecki, Jul 12, 2005 IP
    kdb003 likes this.
  12. kdb003

    kdb003 Active Member

    Messages:
    150
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    73
    #32
    i like using public domain text and modifying it
    i use python to get the text into a form that php can easily insert into a db
    then i use php for the actual scripts(probably not the most scalable approach)
     
    kdb003, Jul 12, 2005 IP
  13. jocs

    jocs Peon

    Messages:
    103
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #33
    I writed my own PHP script, The script is really easy. The hard work was to generate the frequency ordered word lists, and now its solved:
    I have this files, containing elements separated by line breaks:

    Here is the source:
    function give_word($file){
    
    switch($file) {
    case "noun": $max=6767; break;
    case "adj": $max=6500; break;
    case "adv": $max=1300; break;
    case "verb": $max=2996; break;
    case "4n-gram": $max=6500; break;
    case "5n-gram": $max=1366; break;
    case "6n-gram": $max=177; break;
    case "7n-gram": $max=953; break;
    case "8n-gram": $max=410; break;
    case "det": $max=124; break;
    case "2n-det": $max=623; break;
    }
    
    $open = fopen( "freq/".$file.".txt", 'rb' );
    $rdm=rand(0,rand(0,rand(0,$max))); 
       while($rdm>1) {
       fgets($open, 50);
       $rdm--;
       }
       $result=fgets($open, 50);  
       fclose($open);
       return $result;
    }
    PHP:
    noun with 6767 lines
    adj with 6500 lines
    adv with 1300 lines
    verb with 2996 lines
    det with 124 lines
    2n-det with 623 lines
    4n-gram with 6500 lines
    5n-gram with 1366 lines
    6n-gram with 177 lines
    7n-gram with 953 lines
    8n-gram with 410 lines

    The best thing is that the 6767 nouns in nouns file are the first 6767 nouns ordered by frequency of appearance of the elements in english. Then if you get the 1st noun is the most used and the 6767th is the lest used. (there were thousends of nouns but i get only the 6700 first, wanted to have best response time)

    Then the only thing we do is go to a line number, and get all the text in line:

    $open = fopen( "freq/".$file.".txt", 'rb' );
    $rdm=rand(0,rand(0,rand(0,$max))); 
       while($rdm>1) {
       fgets($open, 50);
       $rdm--;
       }
       $result=fgets($open, 50);  
       fclose($open);
       return $result;
    
    PHP:
    I don't use $open=File("...") because its too slow, its better ot open the file and go to the $rdm line.


    Generating Text : Just put this code and you'll have different text every time.
    echo ucfirst(give_word("verb").give_word("4n-gram").give_word("noun").".");
    PHP:
    If you wanna have fun with the full source, or improve it, get this zip file (PHP)
    If you wanna have fun with the text files or improve them, here you have a zip file with them:freq.zip, (ensure you make a dir named freq/ then unzip files in it if you want to use my code, or modify the fopen line)


    PD: why rand(0,rand(0,rand(0,$max)));??? Because if we do only 1 random we get any line with the same probability, if we do 3 rands, we are giving higher probability to the low line numbers and lower to the high line numbers.
     
    jocs, Jul 13, 2005 IP
    GTech likes this.
  14. yo-yo

    yo-yo Well-Known Member

    Messages:
    4,619
    Likes Received:
    205
    Best Answers:
    0
    Trophy Points:
    185
    #34
    Thanks for lists of words.. I can have alot of fun with this :D
     
    yo-yo, Jul 13, 2005 IP
  15. jocs

    jocs Peon

    Messages:
    103
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #35
    I forgot to tell you that, in the Xn-gram lists there are many symbols like # that mean number, in some files I replaced them with a number.

    If you find any combination that generates good looking phrases (adj+noun+verb2+4n-gram+verb .....) post it here please.
     
    jocs, Jul 13, 2005 IP
  16. Perrow

    Perrow Well-Known Member

    Messages:
    1,306
    Likes Received:
    78
    Best Answers:
    0
    Trophy Points:
    140
    #36
    I do think that this method is much more fun and generates quite funny texts. Also, it's one of a series of programming exercises meant to improve your programming skills.
     
    Perrow, Jul 13, 2005 IP
  17. jocs

    jocs Peon

    Messages:
    103
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #37
    Impressive concept! I'll try it.
     
    jocs, Jul 13, 2005 IP
  18. jlawrence

    jlawrence Peon

    Messages:
    1,368
    Likes Received:
    81
    Best Answers:
    0
    Trophy Points:
    0
    #38
    I really don't understand people's problems with this. Fryman, are the pages likely to affect your SERPs - I sincerely hope not, or it doesn't say much for your sites.

    What could be learnt from an experiment like this:
    1) Is there a max number of pages the G will index for a given PR - or is there a max at all.
    2) How many IBL's are needed to actually get things indexed.
    Those are 2 things that spring to mind.
    I'm sure there's actually quite a lot that could be learned about the SE's from an experiment like this.
     
    jlawrence, Jul 13, 2005 IP
  19. niche4u

    niche4u Member

    Messages:
    46
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    41
    #39
    that's a great project but how can u classify different pages in non-english sites?
     
    niche4u, Jul 13, 2005 IP
  20. jocs

    jocs Peon

    Messages:
    103
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #40
    Wow, nice challenge, here it is:
    function parse_trigrams($long_text)
    {
    $i=0;
    $words=preg_split("/[\s,.]+/",$long_text);
    $num_words=count($words);
    
    for( $c=0; $c<$num_words-2; $c++) 
    {
    $key=$words[$c]." ".$words[$c+1];
    $trigram[$key][$words[$c+2]]++;
    }
    return $trigram;
    }
    
    $text_total="it was in the wind that was what he thought was his companion. I think would be a good one and accordingly the ship their situation improved. Slowly so slowly that it beat the band! You’d think no one was a low voice. Don’t take any of the elements and the inventors of the little Frenchman in the enclosed car or cabin completely fitted up in front of the gas in the house and wringing her hands. I’m sure they’ll fall! She looked up at them. He dug a mass of black vapor which it had refused to accept any. As for Mr. Swift as if it goes too high I’ll warn you and you can and swallow frequently. That will make the airship was shooting upward again and just before the raid wouldn’t have been instrumental in capturing the scoundrels right out of jail";
    
    $trigram=parse_trigrams("I wish I may I wish I might"); // Try $text_total for a big one
    
    foreach ($trigram as $key => $value) 
    {
    echo "<br> \"$key\" => [";
    foreach ($value as $key2 => $val) 
    echo " $key2($val), "; // show count also.
    echo "]";
    }
    PHP:
    Its fast coding, so it wont be the fastest or smartest code, but it can "learn" how to write in some language by reading a text. Can be very useful. Thanx Perrow!
    For those who don't now what is this please read Perrow post and follow the link there.

    I don't understand what do you mean, now we are not working on classify diferent pages i non-english sites, can you explain me (in a plain english please) what do you mean? I will reply for sure so don be shy!.


    ... thanx jlawrence!.
     
    jocs, Jul 13, 2005 IP