Billion page site, The test projetc.

Perrow Well-Known Member

Messages:: 1,306

Likes Received:: 78

Best Answers:: 0

Trophy Points:: 140

#41

jocs said:

Its fast coding, so it wont be the fastest or smartest code, but it can "learn" how to write in some language by reading a text. Can be very useful. Thanx Perrow!
Click to expand...

Your welcome (though I'm a bit concerned that I actually helped someone produce black-hat code ).

The most interesting thing about it, SEO wise, is that if you feed it keyword rich text, it will produce keyword rich text (please note that I object to this form of generated content).

You, and all other programmers, should also try the other challenges on Prag Dave's site, and do read up on his explanation of why you should link. The basic reasoning is that in many other areas where skill is needed practitioners spend at least some time exercising, and that this might be useful for programmers as well. I think everybody on this forum would benefit from reading the introduction of the above linked page. It can certainly be applied to most fields, not just programming.

Perrow, Jul 13, 2005 IP

tomecki Peon

Messages:: 369

Likes Received:: 3

Best Answers:: 0

Trophy Points:: 0

#42

Thanks for source code. I will have fun with it.

tomecki, Jul 13, 2005 IP

wwwbug Peon

Messages:: 296

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#43

Does your site work?

wwwbug, Jul 14, 2005 IP

ukmp3 Peon

Messages:: 133

Likes Received:: 3

Best Answers:: 0

Trophy Points:: 0

#44

Dont know where you are getting your results from but google reports all "0's"

check at http://www.mcdar.net/q-check/datatool.asp

ukmp3, Jul 14, 2005 IP

jocs Peon

Messages:: 103

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 0

#45

wwwbug said:

Does your site work?
Click to expand...

Its still under constructuin, we started the project a week ago, I wish It woud be finished, but there are many things that still need a lot of mind work, and also programming.
There are parts working, but the main part is still in development.

ukmp3 said:

Dont know where you are getting your results from but google reports all "0's"

check at http://www.mcdar.net/q-check/datatool.asp
Click to expand...

The test is on a testing url, and there is only 1 public page. In a week or so It will have a domain name for it.

Woud be better if you read all the previous pages.

jocs, Jul 14, 2005 IP

sji2671 Self Made Mind

Messages:: 1,991

Likes Received:: 146

Best Answers:: 0

Trophy Points:: 170

#46

Interesting to watch this, I just fought with google with one of my new sites that had loads of pages droppped by google and sat around 100,000 indexed but today she is back up to over 800,000 so I am aiming for 1 million shortly.

sji2671, Jul 17, 2005 IP

stuw Peon

Messages:: 702

Likes Received:: 44

Best Answers:: 0

Trophy Points:: 0

#47

it would be interesting if a number of billion page sites sprang up - how would that effect the way the guys at Google plan to spider the web? Or do you think they would just get ignored. Interesting - I'm wondereing how many pages the 'world library' they are planning would take up...

stuw, Jul 17, 2005 IP

jocs Peon

Messages:: 103

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 0

#48

Now I'm in the hardest part of work, deciding wich will be the page structures:
For those who know something of programming, I need your opinion:

main idea:

getting url, disglosing it, and calculating the pages tructure:

url will be www.domain.com/001/001/001/001
The main idea is to use the versatility of MD5(see last lines to know whats md5 hash) hash function to ask and repply:

here is what I mean:
function ask_md5($what, $min, $max) {
$question=md5($what);
$question=crc32($question);
$question=hexdec($question);
//$question=hexdec( crc32(md5($what)));
$answer=$min+($question % ($max-$min)); // using modulus % function 
echo "If u ask \"$what\", from $min to $max, computer says \"$answer\"";
// return $answer;
}
PHP:
And you can ask for anything, that will allways return the same number for the same question:
ask_md5("Which number of paragraphs we will have in http://{$HTTP_HOST}{$REQUEST_URI}",3,100);
PHP:
Here we have the number of paragraphs depending on the url.
we can do the same in a deeper way:
$terms=array('noun','adj','adv','verb','4n-gram','5n-gram','6n-gram'
                  ,'7n-gram','8n-gram','det','2n-det','verb2');

ask_md5("Type of word in paragraph 2, line 1, word 33 in http://{$HTTP_HOST}{$REQUEST_URI}",0,12);
PHP:
Whe can just ask as many questions as we want, and then generate the page depending on the results.
I think this system will be one of the easyest ways to return the same words in the same url, but any better or smarter way to do it?

Try the code, lets hear your opinion!

jocs, Jul 18, 2005 IP

prowess Guest

Messages:: 159

Likes Received:: 4

Best Answers:: 0

Trophy Points:: 0

#49

I'd love to see the code. If you don't mind sharing maybe I can help you out.

prowess, Jul 19, 2005 IP

kalius Peon

Messages:: 599

Likes Received:: 27

Best Answers:: 0

Trophy Points:: 0

#50

Are you triying to create semanticaly corect text?

I want to create a good auto-text generator too, have you looked at any of the black hat tools for ideas?

kalius, Jul 19, 2005 IP

isaiasd2003 Guest

Messages:: 216

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 0

#51

I've done a little(actually about 3 yrs) homework, and learned, most search engines do see all and or most pages from a site-clue*. Theres something I know that spammers wish they knew though, I'm not gonna tell, EVER. I want to rank high thanks to all that hard work it took me to learn how to rank high the right way. All would be pissed away if spammers got ahold of such information, they'd just take over. I'm already experiencing some of the effects from spammer attacks which try ruling my keywords. If I got you all confused, sorry. To sum it up, I'm not a spammer, though I study their techniques, so in case a spammer tries taking my rank, I'll know how to go about out smarting him/her without resorting to spam. =D

isaiasd2003, Jul 22, 2005 IP

isaiasd2003 Guest

Messages:: 216

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 0

#52

isaiasd2003 said:

I've done a little(actually about 3 yrs) homework, and learned, most search engines do see all and or most pages from a site-clue*. Theres something I know that spammers wish they knew though, I'm not gonna tell, EVER. I want to rank high thanks to all that hard work it took me to learn how to rank high the right way. All would be pissed away if spammers got ahold of such information, they'd just take over. I'm already experiencing some of the effects from spammer attacks which try ruling my keywords. If I got you all confused, sorry. To sum it up, I'm not a spammer, though I study their techniques, so in case a spammer tries taking my rank, I'll know how to go about out smarting him/her without resorting to spam. =D
Click to expand...

WOHOO! my 100th point! Time to Party!

isaiasd2003, Jul 22, 2005 IP

crazyhorse Peon

Messages:: 1,137

Likes Received:: 19

Best Answers:: 0

Trophy Points:: 0

#53

isaiasd2003 said:

WOHOO! my 100th point! Time to Party!
Click to expand...

Well looks as if you took over some habits of the spammers. |Your spamming the forum............ Something else isn't a forum about sharing ideas with one other. So get the story going on how you can rank well after studying search engine behaviour for three years.

crazyhorse, Jul 22, 2005 IP

kdb003 Active Member

Messages:: 150

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 73

#54

I am curious as to how you are going to get each word. Are you going to do a db query on a large table of n-grams/words for each word on each page. That would be a lot of db queries.

kdb003, Jul 24, 2005 IP

jocs Peon

Messages:: 103

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 0

#55

Sorry about not actualitzating the post, I've been hard working many days, in a few days we we'll have the test site version 0000.0000001, but it will work.

james@prowessamplifiers.c said:

I'd love to see the code. If you don't mind sharing maybe I can help you out.
Click to expand...

Yes, I'll post the source code here, and also with coments, explanations and wich was the idea for each lines of php. Maybe some of you can help improving it.

kalius said:

Are you triying to create semanticaly corect text?

I want to create a good auto-text generator too, have you looked at any of the black hat tools for ideas?
Click to expand...

kdb003 said:

I am curious as to how you are going to get each word. Are you going to do a db query on a large table of n-grams/words for each word on each page. That would be a lot of db queries.
Click to expand...

I've been looking for the best way to generate the content, but the hard thing is that have to be generated with only 12 parameters (the Id of the page, in url /000/000/000/000) and with this parameters you shoud have ALL the info needed to generate always the same page with same links, same paragraphs for each URL. I think I've found the way with the function ask_md5, that gives you an answer from any question, numerically, from MIN to MAX, and its pretty fast.
Have anyone of you tested the ask_md5 function? Did you understand it? (not PHP code, the essence of the function)

jocs, Jul 26, 2005 IP

jocs Peon

Messages:: 103

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 0

#56

hi folks,
I've finally got enough time to finish the first release of code that works. I've uploaded the source and added mod rewrite rules to work with styles.

If you want to take a look: Here

jocs, Aug 10, 2005 IP

isaiasd2003 Guest

Messages:: 216

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 0

#57

hasn't cnn.com already done this? lol Hey is there some way I can find out how many pages a site has? Some sort of tool?

isaiasd2003, Aug 20, 2005 IP

jocs Peon

Messages:: 103

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 0

#58

isaiasd2003 said:

hasn't cnn.com already done this? lol Hey is there some way I can find out how many pages a site has? Some sort of tool?
Click to expand...

There are two ways:
-A spider tool: Some script that will read a page, and follow all links, read all links from tha page and follow them. If you restrict to the same domain, you'll know how many pages does it have, but it may waste a lot of bandwith, and it will show on site statistics that someone has spidered the site, in worst case, you'll get banned for massive downloading site.
-Site command in search engines, the only point is that will only show you the indexed pages of the site, that may be lower or equal to the total pages.

jocs, Aug 22, 2005 IP

tomecki Peon

Messages:: 369

Likes Received:: 3

Best Answers:: 0

Trophy Points:: 0

#59

Ok, it is working. Can you tell us how did you generate the text?

tomecki, Sep 20, 2005 IP

Log in or Sign up

Billion page site, The test projetc.

Perrow Well-Known Member

tomecki Peon

wwwbug Peon

ukmp3 Peon

jocs Peon

sji2671 Self Made Mind

stuw Peon

jocs Peon

prowess Guest

kalius Peon

isaiasd2003 Guest

isaiasd2003 Guest

crazyhorse Peon

kdb003 Active Member

jocs Peon

jocs Peon

isaiasd2003 Guest

jocs Peon

tomecki Peon

Useful Searches