Today i'm probably finally gonna put a site on that i'm working on for some while now and need to put a robots.txt file in it and really have no expierence with this unfortunately With all these spiders and crawlers that don't mean anything for you in a productive way for your site i wanted to know how i would build the most ideal robots.txt and how i should go about it? I want to give access to the most welknown search engine crawlers and spiders and the ones that are not so familiar but would be a good decision to allow. And block unwanted bandwith eating no good spiders and crawlers for my site. What would your approach be guys? How can i handle this in the best way?
Well, this is what googles looks like User-agent: * Allow: /searchhistory/ Disallow: /search Disallow: /groups Disallow: /images Disallow: /catalogs Disallow: /catalog_list Disallow: /news Disallow: /nwshp Disallow: /? Disallow: /addurl/image? Disallow: /pagead/ Disallow: /relpage/ Disallow: /sorry/ Disallow: /imgres Disallow: /keyword/ Disallow: /u/ Disallow: /univ/ Disallow: /cobrand Disallow: /custom Disallow: /advanced_group_search Disallow: /advanced_search Disallow: /googlesite Disallow: /preferences Disallow: /setprefs Disallow: /swr Disallow: /url Disallow: /wml? Disallow: /xhtml? Disallow: /imode? Disallow: /jsky? Disallow: /pda? Disallow: /sprint_xhtml Disallow: /sprint_wml Disallow: /pqa Disallow: /palm Disallow: /gwt/ Disallow: /purchases Disallow: /hws Disallow: /bsd? Disallow: /linux? Disallow: /mac? Disallow: /microsoft? Disallow: /unclesam? Disallow: /answers/search?q= Disallow: /local? Disallow: /local_url Disallow: /froogle? Disallow: /froogle_ Disallow: /print? Disallow: /scholar? Disallow: /complete Disallow: /sponsoredlinks Disallow: /videosearch? Disallow: /videopreview? Disallow: /videoprograminfo? Disallow: /maps? Disallow: /translate? Disallow: /ie? Disallow: /sms/demo? Disallow: /katrina? Disallow: /blogsearch? Disallow: /reader/ Disallow: /chart?
Holy s*%t, that alone from Google itself? That would be a very long list in it's totallity How do you guys manage that? I mean surely you aren't allowing every crawler and spider?
That is the robots.txt that is on googles server, not what you should add.. (http://www.google.com/robots.txt) It is probably ok to just leave the file empty, or to allow every robot. I never had a reason to block a bot, and the really nasty bots will ignore robots.txt anyhow.
Yeah,i guess the nasty bots ignore the file anyways as i have seen discussed in other threads before. I don't have any pages that i have a problem with getting indexed so would i need a robots.txt file in this case? And if i want to allow every bot how should i make the robots.txt file? Would it be a good idea to make a sticky about this subject since this won't be the first time this is being or going to be asked. With the instructions how to set something like this up and what kind of options there are for the various instructions that are possible with setting up a robots.txt file. And maybe do's and dont's? Just an idea because i have no clue at this point where to begin and how i need to set this up.
no no. if you don't have a robots file you don't have any lmiitations controlling what robots can do when they crawl your site.
Well, without a robots.txt your error logs might get filled up with all the requests to nonexistent file, but thats the only disadvantage I can think of.. Sticky seems like a good idea, I also dont know what the ideal robots.txt should look like.
ohhh...been wondering what that is you can then just allow all robots to the / folder, which will give full access to everything.
Never thought this could also cause errors? A bit confused on this one Ideal is a bit asked to much i guess because this would vary from person to person for what would be ideal but something close to ideal or what to look out for is very welcome information for everyone wondering about this subject Comeone guys let's make a sticky about this
There are some directories you don't want crawled because you don't want the spiders wasting time in there and you don't want the SEs indexing stuff people are never going to see or stuff that has no content. A forum is a good case in point: You want to restrict certain files and directories to concentrate the spiders on the ones that matter. Mine looks like this: User-agent: * Disallow: /cgi-bin/ Disallow: /images/ Disallow: /media/ Disallow: /misc/ Disallow: /stats/ Disallow: /phpbb/admin/ Disallow: /phpbb/db/ Disallow: /phpbb/images/ Disallow: /phpbb/includes/ Disallow: /phpbb/language/ Disallow: /phpbb/profile.php Disallow: /phpbb/groupcp.php Disallow: /phpbb/memberlist.php Disallow: /phpbb/login.php Disallow: /phpbb/modcp.php Disallow: /phpbb/posting.php Disallow: /phpbb/privmsg.php Disallow: /phpbb/search.php Code (markup):
For this Site iam using this. User-agent: * Disallow: /images/ Is this enough if i want all the Bots to see my site and get my every page indexed?
Probably. What it actually means is that there is nothing in your robots.txt file that blocks spiders from anything except your images folder. It's always possible there is something else about your site or your hosting that may be a problem but the robots.txt file is fine.
When I was a moderator at WPW, we created a sticky on the robots.txt file. People still kept asking the questions - granted, we could refer them to the sticky but then you still have to answer additional questions that aren't clear in the sticky or about unique circumstances or about issues that won't be fixed by a robots.txt file.
You can block a few more bad bots with user agent blocking in httpd.conf. My list of bad robots is at What are some bad web robots?
Thanks Will, that some good info overthere i put the site in my ''personal'' SEO toolbox, something i am working on lately To be clear on something... I only have to remove the asterix right? and replace it with the useragents in the *bad bots* list on your site? # go away User-agent: * Disallow: / Or do i have to keep this line in place and start lining the bad robots up from under there? Also can i make this file trough Wordpad? This method also sounds good only i didn't grasp how to implement this by looking at the website. Also even though questions will still be asked regarding the robots.txt file i think a lot of future questions can be answered by planting a sticky regarding this subject. And questions still being asked even though they are explained fully in a sticky is probably a reoccuring problem that can't be avoided i think. Questions that are still being asked can be valid though if the sticky doesn't fully cover the options of setting up a robots.txt file and will only improve the quality of such an information source. Also since DP is getting more popular in time, querries made in search engines regarding this subject by beginning webmasters such as myself can increase the growth of DP'forums because of some of the results refering to DP Don't know if Shawn is down for DP expanding even more but if he is it's a good oppurtunity i guess. Ok, enough for the sales pitch
Bear in mind that the bad bots will usually ignore robots.txt anyway so it's sort of a waste of time adding them to your robots.txt file. Use Notepad instead and make sure you're saving it as plain text (ANSI or ASCII).
Will, there are a few odd ones in your list - I realize it came from funender but he has given some dubious advice on other forums as well - you may want to edit that list. 1. GetRight - this is an FTP download program (which I use myself for downloading updates, etc., on dial-up) - why would you block this? 2. FrontPage - why would this even appear unless you yourself are using FrontPage yourself? 3. Xenu and Xenu Link Sleuth - this is a popular freeware link checker - again, I use it myself about once a month or so to scan for broken or redirected links on my site - if you ban this and someone who links to you uses Xenu, you stand a very good chance of losing backlinks. 4. why is larbin there? 5. the lwp-trivial are not going to do anything - that's a string used by one of the forum-attacker worms and it sure as hell isn't going to stop to read your robots.txt file There are a few others there that are benign and, as I said to Edz, most of the actual bad bots listed there are not going to obey robots.txt file directives anyway. It's a waste of time using robots.txt files like this, IMO.