Anyone remember? Even though they aren't displaying it currently, it looks like (at the time of this post anyways) they currently have 21,440,000,000 pages indexed: http://www.google.com/search?num=100&hl=en&lr=&safe=off&q=***+OR+***+AND+***&btnG=Search Was wondering how much that would mean the internet had grown, and over what time period, and if anyone knew how that compared to Yahoo's index? Hard to figure Y's out, most I can come up with is 8,780,000,000. Also, anyone know what the "average" size of a webpage is, excluding images, and approx how much harddrive space it would take to hold just one copy of 21,440,000,000 pages? -Michael
Average size of a webpage is around 30kb excluding images. The search you did doesn't mean google has 21 billion pages. It means that many matches were found. The 8 billion, pages that they used to mention was ages ago if I remember correctly. Like in 99 or 2000. I could be wrong. Hard disk space for 20 billion pages is easy to calculate. 30kb X 20 Billion = 600 Billion Kilobytes. 600 Billion Kilobytes = 600 Billion / 1024 / 1024 / 1024 = 558 Terrabytes So you would need 744 + a few more (for clustering and filesystem) of the seagate 750 Gb Sata drives. For a Raid, you would need the above figure x 3.
With the search I did, the 2 should be synonymous. If there is error (which I am pretty damn sure there is at least some of) then it would be in how Google calculates result counts, not in the logic of the search. -Michael
Well its not did u look at google's help and see what the '*' char means? Plus does it make a difference if u use 1, 2, or 3 '*' Did u try that same query or a similar query on different agents like yahoo or msn.
I didn't look it up, it's a wildcard search. What is it about "match {anything} OR {anything} AND {anything}" that makes you think it is substantially different from "match everything"? I mean, if I'm missing something, fine, but I don't see what it is. Like I said, if it's off then afaik it would be because of the way G was counting the results, not due to the logic of the search. Yahoo doesn't support wildcard searches, and MSN has cooties. I found what looks to be a good indicator of the index growth, by the way. Again, not sure how accurate the dates are, because I think sometimes archive.org shows the same cache for multiple dates, but it's the closest thing I've found so far: Oct 9, 2001: Searching 1,610,476,000 web pages. Aug 3, 2002: Searching 2,073,418,204 web pages. Feb 2, 2003: Searching 3,083,324,652 web pages. Oct 6, 2003: Searching 3,307,998,701 web pages. Oct 19, 2004: Searching 4,285,199,774 web pages. Dec 7, 2004: Searching 8,058,044,651 web pages. Aug 6, 2005: Searching 8,168,684,336 web pages. -Michael Edit: If you're saying that there might be more pages than that, then I'm not disagreeing with you. I'm saying there are at least that many (if the result count is accurate), but would have no clue how to elicit the rest.
I'm saying that there are LESS pages then what ur wildcard query is showing. Going by the links u showed, according to the wayback machine, in Aug 2005 google claimed to be searching around 8 billion. 21 billion is almost 3 times as much as 8 billion.
*** OR *** AND *** this query is now down from 21 billion to 19 billion. * OR * AND * shows 16 billion. I think if google wanted to tell the people how many pages in their index they would do it outright. the query even with a wildcard is not like a select count(*) from google index type query. anyways, I don't even care
Are you saying that the logic behind the query should show less pages, or just that 21 billion is just too big to be right? It is 2.625 times the size of what it last showed, and between 2001 and 2004 it increased over 5 times the starting size. Why do those numbers surprise you? -Michael
I'm saying both. 21 billion is too high there is no query that will give u the number of pages in google's index. Look at the index at the back of the book. certain words are indexed many times. count those --> You won't get the number of pages in the book, u will get far more.
You are right, but I think you are looking at it wrong, or slightly so... If I do a query on the word "the", I'm not looking at how many times the word "the" appears in the index, I'm counting (in theory) how many pages contain the word the. If I do "the OR and", I should be looking at the number of pages that contain either of those words, not the number of pages that contain the first word plus the number of pages that contain the second word... You are right... doing what I did is not exactly like doing a count(*)...but it should be, according to design, almost exactly like doing a "WHERE {anything} LIKE '%'", don't you think? I know there are discrepancies, and I know this is because G uses an algo to estimate the number of results, not count them. This is why the site: operator was busted, they had the algo wrong, according to them. However, also according to them (or according to Matt, anyways) they have fixed that. Hey, Matt, you lurk here... feel free to jump in if you see this. Is this close, or would these diff results we get depending on how we query be one of those bugs you want us to report*? -Michael *Anyone know if he is still reading that list...?