I thought I would take a moment to share the joy and excitement I am having fighting off dumb bots that cannot follow SEF URLs correctly. I'll first show how the bots FU the URLs and then make some suggestions on how to educate a misbehaving bot. ==Overview== Surprisingly the MSN bot is pretty well behaved, along with some offbeat ones such as DoCoMo and ichiro (who seem to be using the google sitemap) As expected, Google is doing a perfect job also (sitemapped). What is really weird is just how bad Slup and it's children are behaving! == Example borks == a pretend cat site using SEF (Search Engine Friendly) URLs /cats/short-hair-cat/ /cats/long-hair-cat/ /cats/bob-tail-cat/ All cool right? Wrong... Here are just some of Slurp's tricks I have seen in the last few weeks. 1. drop the trailing slash: /cats/bob-tail-cat 2. see the dropped slash as an invitation to crawl every link of the bob-tail-cat "directory" that Slurp just created 3. start crawling nonexistant URLs like this: /cats/bob-tail-cat/cats/long-hair-cat/ 4. now drop the leading slash to get cats/long-hair-cat/ 5. crawl again to get: /cats/bob-tail-catcats/long-hair-cat 6. repeat as many times as possible to get: /cats/bob-tail-catcats/short-hair-catcats/long-hair-catcats/bob-tail-cat/ (Slurp went 9 concatenated URLs deep on my site) 7. take a trailing query string such as ?id=99 from another URL and append /cats/bob-tail-cat/cats/short-hair-catcats/long-hair-catcats/bob-tail-cat?id=99 8. now try any nonexistant number in the query /cats/bob-tail-cat/cats/short-hair-catcats/long-hair-catcats/bob-tail-cat?id=92 == The problems == As you would expect, if you had a static site you would get a ton of 404 errors - but then who uses static anymore? For a dynamic site this is a nightmare as it breaks anyway to id what the real request is. For exampe, by using mod-rewrite or parsing the URL this example site might use "short-hair-cat" as part of a query to get content from the DB. If Slurp made it into "short-hair-catcats" then wave goodbye to your content. Load: The Cafe/Kelsa variation of Slurp stopped by one of my sites. It played all of the tricks above and generated thousands of bogus URLs - then started crawling my site every 2 seconds. I couldn't stop it using robots.txt. I couldn't stop it using 404 errors and 20,000 hits later I had to phone them to say stop crawling Extra special problems: You may be thinking, so what. Let Slurp crawl thousands of bogus URLs on your site. It'll be good for your rankings! I did think of that and was seriously considering just letting them go, until I realized that I had Adsense on the site and if Google checked, they would find a ton of pages indexed that didn't have content. TOS violation - bye bye Adsense. == Solutions == My solutions are all based on using PHP - that's what I know and am fastest on - YMMV If your site is driven by delivering content based on the URL you HAVE to validate. Check you can get content from the DB using the variable/URL supplied, if not redirect to an error page (good for adsense). Make sure you send a 404 error header! If there is no trailing slash to the URL as above then rewrite the URL to correct it and redirect the bot using a 301 header to the correct URL If the query string (?xxx=yyy) does not deliver content, redirect to an error page. Make sure you send a 404 error header Special note: Sending the bot a ton of 404 errors is not generally a "good thing" so if you have the time and patience, try and write some code to figure out what the bot was requesting and use a 301 redirect to the correct page. === That's all I have at the moment other than to say, learn to read your raw logs. Some of these problems may not be apparent from your log reader program. If anyone has comments fire away - of course any suggestions or solutions would be great. Hope that helps someone
I think I *may* understand what you mean. I'll certainly remember this as I develope future web sites or just upgrade old ones. Thank you =)
Good tip. Thanks for letting everyone know. This is good for dynamic sites. Glad to see people still test and post it on the forums.
I have had this problem recently with a directory site that I merged a bunch of seperate directories together and used redirects on paths.. Some of the bots did fine, but others just started making up their own index.