Hello, How much space will I need to make a copy of the internet's HTML documents? i.e. How much space will I need to store one copy of each HTML (including XML and XHTML) doument on the web given that I can crawl all (well, almost all) the web? I am talking seriously
Depends on the size and how many you are copying. At first it is smart to start Medium size like 10GB and then wait a bit and then you should be able to make more of a guess then if not a stable answer.
I'm not too good with math but I will get you started; Theres currently over 20billion pages on the web (Yahoo indexed 19billion in August 2005)... You will have to allow for more websites now. Apparently the average size of a html file is 25k... You will also have to allow for files bigger than this. Times them two numbers together and allow for excess, that should give you somewhere near. I think
Let's talk more precisely: Does 10 petabytes of space seem like enough to store all the HTML documents on the web? If so, how much do you think that 10 petabyte storage hardware would cost?
I have my site hosted on Servage so I never had to worry about hosting large amount of data. That is why it's important keep your future requirements in mind when choosing a hosting provider