Sometime about a year ago I remember reading one persons theory that Google is going to run into problem when it reaches 2^32 pages in it's main web index. That is 4,294,967,296 web pages and Google currently states on it's homepage that it has 4,285,199,774 web pages. I've always been a big Google fan since about 2000 but I've noticed it seems to be losing its edge when it comes to certain searches (I'm pretty picky and looking for very specific things sometimes). It's still great don't get me wrong, but I've noticed a lot of so called "dancing" lately, sites disappearing and reappearing into SERP's. Completely new pages being added then disappearing is what concerns me. While comparing these results to Yahoo/Overture when they've added the new pages lightning fast and they're staying there. So has anyone got any ideas or theories about the 32bit limit? I'd assume Google being a bunch of smart cookies should easily be able to overcome a theoretical problem like that. Another reason is that Google is trying to limit the "spammy" results from weeding too far into its index, what are your thoughts?
I'd be very surprised if Google hadn't addressed the 32-bit problem a long time ago. They could have moved their 'primary key' field to 64-bit, or a GUID, or maybe they don't even need a 'primary key' in their database at all-- the URL of the page itself is, by definition, unique and therefore could serve as a 'primary key'... -- Derek
I also think Google is quite capable of handling such a problem (if it even exists). Although it doesn't necessarily answer the question why they haven't broken any amazing numbers for their index. I remember reading an article somewhere at the start of the year that a Google spokesperson had said they hoped to have 10 billion by the end of the year. Though I suppose Google would pick quality over gross quantity.
It appears to me as if they HAVE sorted it: http://www.google.com/search?q=the Returns 5,800,000,000 results for me, 1505032704 more than 4,294,967,296.
And why, I wonder, is 'how' a filtered word but not 'the'? Conspiracy theory here: Maybe the "the" search is smoke and mirrors on Google's part to make it LOOK like they handle more than 4 billion web pages... Maybe someone should go through and count each page to make sure. Volunteers? -- Derek
Yep, if you pay me 5 cents per page. Of course I'm sure you're aware the figures quoted in the SERP's are only estimations not exactly figures.
i'm sure i saw someone post somewhere that the "©2004 Google - Searching 4,285,199,774 web pages" has been the same for at least the last year (apart from the 2004 and bit ) which is odd considering they are indexing new pages all the time.
Try this link for a discussion on the theoretical limit of the Google index and how Google could address it. Google Index ID
Thanks for the links everyone, the "Is Google Broken" link looks very familiar to me, but the date would indicate otherwise.
I think the example with the ID for cached documents is good enough, seems like theres enough ID space for 2^72 documents which is huge... like 4,722,366,482,869,645,213,696 is how many dollars you wish you had
My experiences are that both of these SE's are notoriously slow at indexing new pages and new websites.
I have a small site of around 200 pages. I been noticing some of my pages are dropping out of the index. Also some of my pages only list the url when I do a search site:mysite.com. No description or title tags showing in the google index. Here is a good article on that topic: http:***//www.w3reports.com/index.php?itemid=549 remove the *** from the url
I don't seem to see anyone mentioning that Google now have not one, but two indexes since the addition of their supplemental index. IMO that solves the 32 bit address problem rather easily, if in fact it ever existed.