I'm working on a few sites right now and am doing Google Site Maps for them. I noticing that the site crawlers that I've turned loose on a couple of the sites (for the purposes of creating the XML file for Google) have choked a few times while going through the site. I'm wondering what can be done in general to improve site crawlability? I can't reveal the site names as they are for clients so I guess I'm just looking for any general suggestions. What can we do to make it easier on those poor spiders? (By the way, Matt Cutts posted the following in his blog a few days ago which is another reason crawlability has been on my mind. "Truthfully, much of the best SEO is common-sense: making sure that a site’s architecture is crawlable, coming up with useful content or services that has the words that people search for, and looking for smart marketing angles so that people find out about your site (without trying to take shortcuts)."
Some factor may make your site more difficult to be crawl.. 1.SessionID 2.Number of directory level 3.robot.txt 4.sites that require login..
Thanks WUA. 1. No session IDs while browsing the site. 2. Not quite sure what you mean here. Do you mean how deep the site goes? 3. Robots.txt. We have this to keep the spiders out of our admin and a few places in vBulletin. But that's a good thing right? 4. We have a login system but don't require people to login to access any parts of the site.
Have you used poodle predictor and Xenu to see if they give any feedback. Xenu can show just what the urls end up looking like. Sometimes I see things like mysite.com/info/../articles/something-interesting.html and while it's ok in the browser it's soooo dodgy, and the site owners don't even realise what they're doing.
what I have seen is that sites require cookies to be used because their CMS requires it. set your Internet Explorer to not accept any cookie at all, try to navigate it again, see if you can get to all the pages.
Good tips. I did notice that when I turned off cookies session IDs do get appended to the URLs (my guess is that has something to do with our e-commerce functionality). I was able to browse the site just fine but I'm wondering if that could be causing any problems?
This is kind of obvious, but if possible, keep all your pages no more than 3 levels deep (home page (1), category (2), category sub page (3). Making it easy to crawl for people also makes it easy for spiders.
Yahoo remain behind the 8 ball. No feedback on how often they'll revisit the list (but I like that they do standard rss), no ability to ping, no account management so you can see if there were issues. Yahoo prove, again, why they are the poor relation to Google. And for smaller sites this should be totally acheivable. But remember you can "deeplink" into your site in any number of creative ways, and natural linking is frequently a deeplink. This means the search engines are told that those pages which are more than 3 levels deep are also really important. FYI: WordPress lets old posts go much more than 3 levels deep. The navigation is sound so it works despite that.