Should I deny access to forum threads to googlebot? I'm thinking that if I don't... they'll spend more time indexing forum threads than the site pages itself (I've run a search engine spider script on my site and it started indexing the forum first and took fooorrrever). The forum has about.. 300,000 threads.
No. Just make sure the forum links back to the site and they'll end up there too. Your main site is leikely to have more IBLs anyway hence more likely to get spidered more often. You can also try their Sitemap feature.
Don't remove the robots.txt file!! This will prevent bots from being able to spider your site at all; modify to suit your needs, but do not remove.
I would respectfully disagree. Not a single one of my sites has a robots file and they are all fully indexed. For example: One of my sites http://www.google.ie/search?hl=en&q=site:www.yourfavouriteshop.com&meta= That site has never had a robots file I routinely advise newbies not to use one as they always seem to mess it up causing problems with indexing. Make sure you understand the syntax before you put one up.
Is it safe to assume that you also have a plethora of other sites linking to yours? I made the comment as the "new" MSN, and some others, seem to be "resistant" to crawling new sites and it seems to help to have the robots.txt file. I too have seen many sites well indexed in all major engines without the robots file, but many of those have been well established for sometime. For my own, and maybe others', education, what are some of the potential pitfalls of having the robots.txt file? Thanks!
I don't think there are any pitfalls to having or not having a robots.txt file - some of my sites have them and some don't and I've never noticed a change when I've added/deleted one. I tend to add them to most of my sites nowadays though otherwise I end up with loads of 404 errors cluttering my logs from all the search engines requesting it by default, same goes for favicon.ico.
Removing the robots.txt has no effect on the indexing of your page. None of my sites have a robots.txt file, and they've all been spidered by Google and every other large SE. Josh
I guess we can put the hard to make a default to bed easily. I have seen, what I believe to be, better attention from having one. Maybe it's just perception, but I have a few examples that I won't share that are my proof. Anywho, if ya want one, use this and everything will be allowed spidered... User-agent: * Disallow: Code (markup): Save that as robots.txt and put it in your root folder. Note that there is a newline/return after Disallow:. Now no matter which you prefer, you can do it right.
But no robots.txt file = disallow none, so why create something that sets a preference thats already set by deafult? Josh
Because whether or not you have one, Googlebot will check for it? The only downside is making a syntax error -- do what noppid says exactly in a text editor and you'll be fine. The original question, however, asked about a forum -- you can disallow certain files and folders in the forum to keep Googlebot on the actual posts pages and out of things like the members list and the admin and "process" files (posting, etc.). For example, a robots.txt file for a phpBB forum might look like this: User-agent: * Disallow: /cgi-bin/ Disallow: /media/ Disallow: /misc/ Disallow: /stats/ Disallow: /phpbb/admin/ Disallow: /phpbb/db/ Disallow: /phpbb/images/ Disallow: /phpbb/includes/ Disallow: /phpbb/language/ Disallow: /phpbb/profile.php Disallow: /phpbb/groupcp.php Disallow: /phpbb/memberlist.php Disallow: /phpbb/login.php Disallow: /phpbb/modcp.php Disallow: /phpbb/posting.php Disallow: /phpbb/privmsg.php Disallow: /phpbb/search.php Code (markup):
I agree about the syntax error which has no effect on SEO but can be annoying for some people. Its still my opinion that that is the only reason you may need a robots file if you want your whole site indexed.
As I said earlier in the thread, all the reputable search engines will check for its existence anyway, so if you don't have one (even if it's just a blank file), you'll end up with loads of 404 errors clogging up your logs. Creating a "default" robots.txt stops this.
Me too. I don't like seeing 404 errors in awstats. makes me go look in the logs to see what link I broke. Besides, it makes me feel good to explicitly invite the robots in and not just leave the door open and imply that it is ok.
You can also have an empty file called robots.txt and it won't affect spidering, but depending on the size of your error doc, you could save a little bit of bandwidth.
I really wouldn't advise using an empty robots.txt file -- that may confuse some spiders -- either omit it altogether, which spiders know how to handle, or include the all-purpose "go ahead and spider everything" version posted earlier: User-agent: * Disallow: Code (markup):
Why would it confuse spiders any more than a lack of a robots.txt file? They'll download it, read in the rules and default to "allowed to crawl anything" if there's nothing there. I very much doubt that any of the major search engines would choke on an empty file, and any robot that does is probably poorly written anyway.
Have you looked at the code for all spiders? I'm not certain that it WOULD confuse them - I'd be concerned that it MIGHT, as I said. I also do not see the point of an empty robots.txt file. Why even have it? We do know that spiders can handle NO robots.txt file just fine so there is zero risk with not having one other tham the 404 log lines. If you are going to bother creating and uploading a robots.txt file at all (which I do recommend), why not put those two lines in it (see above)? Are you that short of server space?
From robotstxt.org http://www.robotstxt.org/wc/exclusion-admin.html From Google http://www.google.com/bot.html#norobots
Depending on you server's 404 error page, you could save a *little* bandwidth by using an empty robots.txt file. (The error page might be 1-2k but an empty robots.txt file will be 0) Also, I have sites with empty robots.txt files that are indexed by G,Y & MSN and others, so I think that's evidence enought that there's not a problem. (At least with the robots that matter) If you want to type in the two lines to allow access to everything, go right ahead or take the lazy man's way out and upload a blank file. It does the same thing.