Hmmm... I am suffering trying to get google to index my pages, and here they fill the net with a bunch of garbage and get indexed without problems...
Project site its still not indexed, but I have to buy a domain for it when its finished, an then I'll put links to it. and The only thing that works now is the "Computer generated text:" part, that when you reload the page generates new text. There are no links working, because its only layout (also links, that all of them are href=#) and text generating script test, at the moment. This week I'll finish the script that generates full page content depending on url. what do you think of generated text? (I know that somtimes it generates really stupid phrases)
It will be interesting to see when it's up. I have been working on a real billion pages indexed in google and yahoo project
You are big! I have no more than 12K, but until the project goes on I have only Spanish sites. But I'm sure we'll be able to reach the limit with this test. I think I'm not missing any important points. Other thing: I'm currently documenting the project, source code, explanation, working examples, to show you The Project in a more serious way. Tomorrow I'll post the source code of the text generating script, maybe someone of you can improve or find a bug on it. In few days I'll have the documentation in PDF, so it will be easier to explain, in a graphical way. (I tried to write at least 10 post explaining the idea, but its too hard to make it easy to understand without graphics and arrows and ...) P.D.: For all of you that think that we'll generate 1 billion pages of spam, its ok, we don't need to read more posts about that. Thanks for the advice. We don't try to spam anyone or anything, we are learning, trying, understanding, prooving, testing....not spaming.
man, thats gotta take alot of time, and might be some of the reason why google is forced to screw us real webmasters with lower ranks, etc. I don't believe you need to spam the engines to understand G. I never had to, all I did is hard work, understanding how to opt my sites, get backlinks, and read all I can on new stuff. Whatever you find out may become useless because G changes every now and then, and you'll be at square one again. As for me, I'd be screwed, having to learn new things G requires of a site just to be accepted.
What I still don't understand is what is the use of this ridiculous "test"? Why would anyone even care if google is able to index a billion pages or not? If you are going to do tests, do it for something useful, go test any of the million google myths out there, but filling the net with more garbage and calling it an interesting test is absurd.
i think generating random pages can be fun and interesting, but just don't link to them. If I were outputting some random text I would combine a few translators and random word generators(ie my text generator and my word replacer)
What kind of scripts are you using for generating text? I like perl module "Silly::Wabby" from CPAN and you?
i like using public domain text and modifying it i use python to get the text into a form that php can easily insert into a db then i use php for the actual scripts(probably not the most scalable approach)
I writed my own PHP script, The script is really easy. The hard work was to generate the frequency ordered word lists, and now its solved: I have this files, containing elements separated by line breaks: Here is the source: function give_word($file){ switch($file) { case "noun": $max=6767; break; case "adj": $max=6500; break; case "adv": $max=1300; break; case "verb": $max=2996; break; case "4n-gram": $max=6500; break; case "5n-gram": $max=1366; break; case "6n-gram": $max=177; break; case "7n-gram": $max=953; break; case "8n-gram": $max=410; break; case "det": $max=124; break; case "2n-det": $max=623; break; } $open = fopen( "freq/".$file.".txt", 'rb' ); $rdm=rand(0,rand(0,rand(0,$max))); while($rdm>1) { fgets($open, 50); $rdm--; } $result=fgets($open, 50); fclose($open); return $result; } PHP: noun with 6767 lines adj with 6500 lines adv with 1300 lines verb with 2996 lines det with 124 lines 2n-det with 623 lines 4n-gram with 6500 lines 5n-gram with 1366 lines 6n-gram with 177 lines 7n-gram with 953 lines 8n-gram with 410 lines The best thing is that the 6767 nouns in nouns file are the first 6767 nouns ordered by frequency of appearance of the elements in english. Then if you get the 1st noun is the most used and the 6767th is the lest used. (there were thousends of nouns but i get only the 6700 first, wanted to have best response time) Then the only thing we do is go to a line number, and get all the text in line: $open = fopen( "freq/".$file.".txt", 'rb' ); $rdm=rand(0,rand(0,rand(0,$max))); while($rdm>1) { fgets($open, 50); $rdm--; } $result=fgets($open, 50); fclose($open); return $result; PHP: I don't use $open=File("...") because its too slow, its better ot open the file and go to the $rdm line. Generating Text : Just put this code and you'll have different text every time. echo ucfirst(give_word("verb").give_word("4n-gram").give_word("noun")."."); PHP: If you wanna have fun with the full source, or improve it, get this zip file (PHP) If you wanna have fun with the text files or improve them, here you have a zip file with them:freq.zip, (ensure you make a dir named freq/ then unzip files in it if you want to use my code, or modify the fopen line) PD: why rand(0,rand(0,rand(0,$max)));??? Because if we do only 1 random we get any line with the same probability, if we do 3 rands, we are giving higher probability to the low line numbers and lower to the high line numbers.
I forgot to tell you that, in the Xn-gram lists there are many symbols like # that mean number, in some files I replaced them with a number. If you find any combination that generates good looking phrases (adj+noun+verb2+4n-gram+verb .....) post it here please.
I do think that this method is much more fun and generates quite funny texts. Also, it's one of a series of programming exercises meant to improve your programming skills.
I really don't understand people's problems with this. Fryman, are the pages likely to affect your SERPs - I sincerely hope not, or it doesn't say much for your sites. What could be learnt from an experiment like this: 1) Is there a max number of pages the G will index for a given PR - or is there a max at all. 2) How many IBL's are needed to actually get things indexed. Those are 2 things that spring to mind. I'm sure there's actually quite a lot that could be learned about the SE's from an experiment like this.
Wow, nice challenge, here it is: function parse_trigrams($long_text) { $i=0; $words=preg_split("/[\s,.]+/",$long_text); $num_words=count($words); for( $c=0; $c<$num_words-2; $c++) { $key=$words[$c]." ".$words[$c+1]; $trigram[$key][$words[$c+2]]++; } return $trigram; } $text_total="it was in the wind that was what he thought was his companion. I think would be a good one and accordingly the ship their situation improved. Slowly so slowly that it beat the band! You’d think no one was a low voice. Don’t take any of the elements and the inventors of the little Frenchman in the enclosed car or cabin completely fitted up in front of the gas in the house and wringing her hands. I’m sure they’ll fall! She looked up at them. He dug a mass of black vapor which it had refused to accept any. As for Mr. Swift as if it goes too high I’ll warn you and you can and swallow frequently. That will make the airship was shooting upward again and just before the raid wouldn’t have been instrumental in capturing the scoundrels right out of jail"; $trigram=parse_trigrams("I wish I may I wish I might"); // Try $text_total for a big one foreach ($trigram as $key => $value) { echo "<br> \"$key\" => ["; foreach ($value as $key2 => $val) echo " $key2($val), "; // show count also. echo "]"; } PHP: Its fast coding, so it wont be the fastest or smartest code, but it can "learn" how to write in some language by reading a text. Can be very useful. Thanx Perrow! For those who don't now what is this please read Perrow post and follow the link there. I don't understand what do you mean, now we are not working on classify diferent pages i non-english sites, can you explain me (in a plain english please) what do you mean? I will reply for sure so don be shy!. ... thanx jlawrence!.