I just built a sitemap, and it's 1.8 Megs. I've heard to keep it under a certain file size, but I can't remember if it's 100K, 150K or what. I'm assuming that the thing to do would be to split it up into several files and cross link those pages. What size do you recommend?
Google's official webmaster guidelines said no more than 100 links per page, last I checked... -- Derek
I checked and it does say that, but I'm not sure that applies to sitemaps. With 3000 pages that puts me at a 31 page sitemap which I think is a little too large. Does anyone else have any input?
Make more than one sitemap. Keep it under 100k. Even though I've read that G00gle all but ignores anything after 100k, I've seen sites ranking very high on G00gle that are well above 100k.
With big sites I've seen many sitemaps that are divided into numbers or letters, like 1, 2, 3, 4, or A, B, C, D,.. just a thought: If you have your site map in html, import it in exel set the cells to reorganize all the urls alfabeticly then copy the links for each letter back to html in single pages naming them A, B, C, D and so on. maby this will work for you.
The suggested limits are indeed 100 links and 101kb -- the 101kb, however, is text of course since spiders don't read images -- beyond 101kb of text, I wouldn't count on the spiders reading anything. As for the site map, anything that huge isn't being designed for humans, obviously, and although Google isn't clear on the issue I wouldn't want to bet that Googlebot will follow 3000 links. My recommendation would definitely be to organize it into categories of no more than 100 links per category -- another advantage of this is that each page links back to the home page and each page, if correctly categorized, has an opportunity to acquire PR from other pages.
I've seen site maps with over 3,000 links on them helping in quickly getting a site indexed in a couple of weeks. Even the links at the bottom of the site-map got crawled. I've seen this many times.
Okay. I see a page with a lot of links with slow load time. I also see that Google has cached the entire page. I also see that the page (text only) weighs in at a whopping 615 kb. The fact that it's cached does rather suggest that the conventional wisdom of Googlebot stopping at 101 kb is no longer true. That alone is a bit surprising... However, even if every page listed on this one has been indexed by Google, whether or not every page linked from this page was spidered FROM this page is another question -- one that can't be answered based on this information alone.... so the question of how many links Googlebot will follow from a single page remains open.
The 101 kb thing is a myth I would say. It all depends on how much PR you point at the site map. If you point a few PR5 or PR6 links, the bot would crawl all the links on a huge page like the one I posted above also. I can show you an example of an amazon site which got 25,000 pages indexed in two weeks from a site map created on another site which was a high PR6. The site-map from the other site drove the bot to deap crawl the amazon site in days. It was insane how fast it happened. The links on the site map were the only links pointing into those pages on the amazon site.. so it would answer your second question. Since it is a spammy move for the PR6 site to do this, I wouldn't post the link here in the forum. I can PM you the url, if you want to persue this for research purposes.
It has been for a long time but like all search engine mythology it doesn't stop the blind from leading the blind. Take this search in Google for example "zouave zounds zulu zwischen zygote zymotic" (include the quotation marks). One hit, 355k Cached and Google is indexing the last 6 words. - Michael
Google's limit according to me right now is 513k, earlier it used to be 101k. Show me one page above 513k, and I am wrong. I will send you $20 via paypal if I am wrong . Offer open for 7 days from today. So even if you can get a 1000k page indexed, go ahead. $20 waiting for you here.
The 600k+ page posted above shows 513k in Google, so I'd say you're probably right. 512/513k is a byte barrier where Google engineers would probably cut off, too... Nice info, Honey. -- Derek
Search Google for "ninny pandir zozoter zozote- habis! (oh)" One hit, 520k Cached and the last 6 words indexed. - Michael
This thread appears to be very interesting. I knew that the 101k limit was in doubt recently, I can see it's no longer valid now! Back to the thread start, a sitemap should be for users too, so it should not be too large, instead it should be organized alphabetically or by subject. A directory in other words, a directory within a site and regarding the site's pages.
Ideally, I'd like a script that spiders, sorts things into their appropriate categories, and grabs meta information for anchor text. It lays out everything in a configurable layout (i.e. #of links per page, how many pages), is search engine friendly, and can use some sort of html template. I have found several paid scripts that almost meet my needs, but nothing that is worth it so far. I have several spidering scripts that I can modify, but it's not a priority for me at the moment. For now, it's manual. I used sitemapper.pl in order to build this sitemap, but it ended up using a ton of memory with such a large site. (no writing to file until it is all done). I'm tempted to link this extraordinarily large sitemap now, but I'm worried about the page size being a problem - not just for spiders, but for end users as well. My pages are pretty well cross linked, so I think I'm going to wait until I at least have something reasonable before "making it live". I appreciate the input. I think 512K is a nice target to shoot for with such a large site. I think I'm going to go a little conservative with maybe 300K being the goal, with something in the way of a lighter option for my dial up visitors. 300K isn't a lot of links considering when you add title text and an excerpt.