BabbFest - Ringtones - Fast Loans - MPAA - Eyes on Me

PDA

View Full Version : The most ideal robots.txt? how should it look like?


Edz
Nov 24th 2005, 11:04 am
Today i'm probably finally gonna put a site on that i'm working on for some while now and need to put a robots.txt file in it and really have no expierence with this unfortunately:(

With all these spiders and crawlers that don't mean anything for you in a productive way for your site i wanted to know how i would build the most ideal robots.txt and how i should go about it?

I want to give access to the most welknown search engine crawlers and spiders and the ones that are not so familiar but would be a good decision to allow.
And block unwanted bandwith eating no good spiders and crawlers for my site.

What would your approach be guys?
How can i handle this in the best way?

evilmonkeyspanker
Nov 24th 2005, 11:50 am
Well, this is what googles looks like

User-agent: *
Allow: /searchhistory/
Disallow: /search
Disallow: /groups
Disallow: /images
Disallow: /catalogs
Disallow: /catalog_list
Disallow: /news
Disallow: /nwshp
Disallow: /?
Disallow: /addurl/image?
Disallow: /pagead/
Disallow: /relpage/
Disallow: /sorry/
Disallow: /imgres
Disallow: /keyword/
Disallow: /u/
Disallow: /univ/
Disallow: /cobrand
Disallow: /custom
Disallow: /advanced_group_search
Disallow: /advanced_search
Disallow: /googlesite
Disallow: /preferences
Disallow: /setprefs
Disallow: /swr
Disallow: /url
Disallow: /wml?
Disallow: /xhtml?
Disallow: /imode?
Disallow: /jsky?
Disallow: /pda?
Disallow: /sprint_xhtml
Disallow: /sprint_wml
Disallow: /pqa
Disallow: /palm
Disallow: /gwt/
Disallow: /purchases
Disallow: /hws
Disallow: /bsd?
Disallow: /linux?
Disallow: /mac?
Disallow: /microsoft?
Disallow: /unclesam?
Disallow: /answers/search?q=
Disallow: /local?
Disallow: /local_url
Disallow: /froogle?
Disallow: /froogle_
Disallow: /print?
Disallow: /scholar?
Disallow: /complete
Disallow: /sponsoredlinks
Disallow: /videosearch?
Disallow: /videopreview?
Disallow: /videoprograminfo?
Disallow: /maps?
Disallow: /translate?
Disallow: /ie?
Disallow: /sms/demo?
Disallow: /katrina?
Disallow: /blogsearch?
Disallow: /reader/
Disallow: /chart?

Edz
Nov 24th 2005, 1:34 pm
Holy s*%t, that alone from Google itself?

That would be a very long list in it's totallity:eek:

How do you guys manage that? I mean surely you aren't allowing every crawler and spider?

Dejavu
Nov 24th 2005, 1:38 pm
That is the robots.txt that is on googles server, not what you should add..
(http://www.google.com/robots.txt)
It is probably ok to just leave the file empty, or to allow every robot. I never had a reason to block a bot, and the really nasty bots will ignore robots.txt anyhow.

Edz
Nov 24th 2005, 3:22 pm
Yeah,i guess the nasty bots ignore the file anyways as i have seen discussed in other threads before.

I don't have any pages that i have a problem with getting indexed so would i need a robots.txt file in this case?

And if i want to allow every bot how should i make the robots.txt file?

Would it be a good idea to make a sticky about this subject since this won't be the first time this is being or going to be asked.

With the instructions how to set something like this up and what kind of options there are for the various instructions that are possible with setting up a robots.txt file.

And maybe do's and dont's?

Just an idea because i have no clue at this point where to begin and how i need to set this up.

Dekker
Nov 24th 2005, 3:23 pm
don't have any pages that i have a problem with getting indexed so would i need a robots.txt file in this case?

no

And if i want to allow every bot how should i make the robots.txt file?

no.

if you don't have a robots file you don't have any lmiitations controlling what robots can do when they crawl your site.

Dejavu
Nov 24th 2005, 3:51 pm
Well, without a robots.txt your error logs might get filled up with all the requests to nonexistent file, but thats the only disadvantage I can think of..
Sticky seems like a good idea, I also dont know what the ideal robots.txt should look like.

Dekker
Nov 24th 2005, 3:53 pm
Well, without a robots.txt your error logs might get filled up with all the requests to nonexistent file, but thats the only disadvantage I can think of..
Sticky seems like a good idea, I also dont know what the ideal robots.txt should look like.

ohhh...been wondering what that is

you can then just allow all robots to the / folder, which will give full access to everything.

Edz
Nov 24th 2005, 3:56 pm
Well, without a robots.txt your error logs might get filled up with all the requests to nonexistent file, but thats the only disadvantage I can think of..

Never thought this could also cause errors? A bit confused on this one:confused:

Sticky seems like a good idea, I also dont know what the ideal robots.txt should look like.


Ideal is a bit asked to much i guess;) because this would vary from person to person for what would be ideal but something close to ideal or what to look out for is very welcome information for everyone wondering about this subject;)

Comeone guys let's make a sticky about this:)

minstrel
Nov 24th 2005, 6:20 pm
There are some directories you don't want crawled because you don't want the spiders wasting time in there and you don't want the SEs indexing stuff people are never going to see or stuff that has no content.

A forum is a good case in point: You want to restrict certain files and directories to concentrate the spiders on the ones that matter. Mine looks like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /media/
Disallow: /misc/
Disallow: /stats/
Disallow: /phpbb/admin/
Disallow: /phpbb/db/
Disallow: /phpbb/images/
Disallow: /phpbb/includes/
Disallow: /phpbb/language/
Disallow: /phpbb/profile.php
Disallow: /phpbb/groupcp.php
Disallow: /phpbb/memberlist.php
Disallow: /phpbb/login.php
Disallow: /phpbb/modcp.php
Disallow: /phpbb/posting.php
Disallow: /phpbb/privmsg.php
Disallow: /phpbb/search.php

amitpatel_3001
Nov 24th 2005, 10:45 pm
For this Site (http://www.earn123.info)

iam using this.

User-agent: *
Disallow: /images/

Is this enough if i want all the Bots to see my site and get my every page indexed?

minstrel
Nov 24th 2005, 11:24 pm
That will allow spidering of everything except what is in your /images folder.

amitpatel_3001
Nov 24th 2005, 11:39 pm
So , this means all the Search engines will index my pages regularly?
Which is what i need

minstrel
Nov 25th 2005, 12:04 am
Probably.

What it actually means is that there is nothing in your robots.txt file that blocks spiders from anything except your images folder.

It's always possible there is something else about your site or your hosting that may be a problem but the robots.txt file is fine.

Edz
Nov 26th 2005, 1:32 am
So, nobody would like to see a sticky coming about this i presume?

minstrel
Nov 26th 2005, 6:42 am
When I was a moderator at WPW, we created a sticky on the robots.txt file. People still kept asking the questions - granted, we could refer them to the sticky but then you still have to answer additional questions that aren't clear in the sticky or about unique circumstances or about issues that won't be fixed by a robots.txt file.

Will.Spencer
Nov 26th 2005, 7:09 am
Yeah,i guess the nasty bots ignore the file anyways as i have seen discussed in other threads before.

You can block a few more bad bots with user agent blocking in httpd.conf (http://www.internet-search-engines-faq.com/prevent-web-site-download.shtml).

My list of bad robots is at What are some bad web robots? (http://www.internet-search-engines-faq.com/bad-robots.shtml)

Edz
Nov 26th 2005, 8:28 am
Thanks Will, that some good info overthere i put the site in my ''personal'' SEO toolbox, something i am working on lately;)

To be clear on something...

I only have to remove the asterix right? and replace it with the useragents in the *bad bots* list on your site?

# go away
User-agent: *
Disallow: /

Or do i have to keep this line User-agent: *in place and start lining the bad robots up from under there?

Also can i make this file trough Wordpad?

You can block a few more bad bots with user agent blocking in httpd.conf.

This method also sounds good only i didn't grasp how to implement this by looking at the website.

Also even though questions will still be asked regarding the robots.txt file i think a lot of future questions can be answered by planting a sticky regarding this subject.
And questions still being asked even though they are explained fully in a sticky is probably a reoccuring problem that can't be avoided i think.

Questions that are still being asked can be valid though if the sticky doesn't fully cover the options of setting up a robots.txt file and will only improve the quality of such an information source.

Also since DP is getting more popular in time, querries made in search engines regarding this subject by beginning webmasters such as myself can increase the growth of DP'forums because of some of the results refering to DP:)

Don't know if Shawn is down for DP expanding even more but if he is it's a good oppurtunity i guess.

Ok, enough for the sales pitch:D

minstrel
Nov 26th 2005, 8:43 am
To be clear on something...

I only have to remove the asterix right? and replace it with the useragents in the *bad bots* list on your site?

# go away
User-agent: *
Disallow: /

Or do i have to keep this line
Quote:
User-agent: *

in place and start lining the bad robots up from under there?
Bear in mind that the bad bots will usually ignore robots.txt anyway so it's sort of a waste of time adding them to your robots.txt file.

Also can i make this file through Wordpad?
Use Notepad instead and make sure you're saving it as plain text (ANSI or ASCII).

minstrel
Nov 26th 2005, 8:54 am
Will, there are a few odd ones in your list - I realize it came from funender but he has given some dubious advice on other forums as well - you may want to edit that list.

1. GetRight - this is an FTP download program (which I use myself for downloading updates, etc., on dial-up) - why would you block this?
2. :confused: FrontPage - why would this even appear unless you yourself are using FrontPage yourself?
3. Xenu and Xenu Link Sleuth - this is a popular freeware link checker - again, I use it myself about once a month or so to scan for broken or redirected links on my site - if you ban this and someone who links to you uses Xenu, you stand a very good chance of losing backlinks.
4. why is larbin there?
5. the lwp-trivial are not going to do anything - that's a string used by one of the forum-attacker worms and it sure as hell isn't going to stop to read your robots.txt file

There are a few others there that are benign and, as I said to Edz, most of the actual bad bots listed there are not going to obey robots.txt file directives anyway. It's a waste of time using robots.txt files like this, IMO.

Will.Spencer
Nov 26th 2005, 9:52 am
Will, there are a few odd ones in your list - I realize it came from funender but he has given some dubious advice on other forums as well - you may want to edit that list.

I've been doing that.

I'll take a look at the ones you mentioned.

Heck, I have to enable Xenu every time I use it on myself.

Edz
Nov 26th 2005, 11:52 am
Thanks for that info Minstrel:cool:

Edz
Dec 9th 2005, 9:47 am
Ok i would like get some more knowledge about this robots.txt so i would like to ask a couple of questions.

If i would want to have the regular bots visiting and indexing my site such as Google and yahoo and such and not wanting the bad bots to visit my site and have them access denied i would have to put in this type of syntaxes right?

I know real bad bots will ignore the robots.txt but everything i can manage to block out with only a simple installment of an .text file is well worth it. What could it hurt right? (or maybe i am missing something here)

Ok here goes:

User-agent: *
Disallow:

Disallow: /
User-agent: Alexibot

Disallow: /
User-agent: Alexibot

Disallow: /
User-agent: Alexibot

I used Alexibot as an sample for the example.
But would this mean that all robots are granted access accept for the ones in with Disallow: / Alexibot ??

I know there are a lot of bots that will ignore the robots.txt file but every bot that is on the list that can be stopped is one that bites the dust and saving me bandwith:)

I will also look into the botsense beta service that yeah i know is also not fail proof but everything helps and it will improve in time even more as they say, hopefully;)

minstrel
Dec 9th 2005, 10:11 am
No.

First, you have the order of lines reversed - the useragent comes first before the disallow:

User-agent: Alexibot
Disallow: /

User-agent: *
Disallow:
Second, repeating the name of the user-agent won't help - once is enough.

Third, construct your robots.txt file for the GOOD bots. Filling it full of stuff for the bad bots won't help you because the bad bots are not going to be even reading the robots.txt file.

Edz
Dec 9th 2005, 11:06 am
Oh man your right, yeah i have to list it as you said and not the other way around, why i even put it like that anyways:confused: copy and paste error.


Yeah, i know repeating is not neccassary but i did this as an example to illustrate various bot names:)
Putting a list in wouldn't hurt if it would help to deter all the bots is another question but if you don't shoot you would miss for certain;)


But in this manor:


User-agent: *
Disallow:

User-agent: Alexibot
Disallow: /

i would allow all bots and would make an attempt to block Alexibot right?

minstrel
Dec 9th 2005, 11:10 am
I think the usual order is disallow bots first and allow all others at the end, though it may not matter at all.

I really don't disallow bots - only certain directories I don't want indexed. As I said, the good bots I want to let in - the bad bots are going to ignore the robots.txt file anyway so all that does is clutter up the file for the good bots. Why make it any harder than it has to be for Googlebot and the other Good Witches?

ServerUnion
Dec 9th 2005, 11:37 am
[QUOTE=minstrel]I think the usual order is disallow bots first and allow all others at the end, though it may not matter at all.
QUOTE]

Unfortunatly, "ALLOW" isnt an acepted function. Unless you are worried about dupe content from print pages or something, just place an empty robots.txt in your root to limit the 404 error in your logs.


Wrote an article a while back that may help in your research: http://www.directory-submission.net/do-i-need-a-robots.txt-file.htm

Will.Spencer
Dec 9th 2005, 5:21 pm
Unless you are worried about dupe content from print pages or something, just place an empty robots.txt in your root to limit the 404 error in your logs.

I keep a nice big fat robots.txt to help reduce bandwidth utilization from annoying and useless web robots.

minstrel
Dec 9th 2005, 7:40 pm
[quote=minstrel]I think the usual order is disallow bots first and allow all others at the end, though it may not matter at all. /QUOTE]

Unfortunatly, "ALLOW" isnt an acepted function. Unless you are worried about dupe content from print pages or something, just place an empty robots.txt in your root to limit the 404 error in your logs.
I was using the term loosely.

To rephrase what I said above, BLOCK the robots you want to block first and then SPECIFY ACCESS to the ones you don't want to block at the end.

That is,

User-agent: Alexibot
Disallow: /

User-agent: *
Disallow:
NOT

User-agent: *
Disallow:

User-agent: Alexibot
Disallow: /

but as I also said above I'm not sure the order really matters.

Edz
Dec 10th 2005, 5:10 am
To be clear on something guys, when i would use this syntax

User-agent: Alexibot
Disallow: /

User-agent: *
Disallow:

Would i need to specify each bot also that i want to grant access? i am not sure about this. Or would above suffice?

Will Spencer, i see on your robots.txt for instance no indication of this, only an indication of the ones that are disallowed. So can i presume i only have to put in the file which ones to block and the ones that aren't listed such as googlebot would crawl the site since it doesn't encounter any reference to it?

minstrel
Dec 10th 2005, 8:10 am
Would i need to specify each bot also that i want to grant access? i am not sure about this. Or would above suffice?
No. This part:

User-agent: *
Disallow:

applies to ALL bots not specifically mentioned.

can i presume i only have to put in the file which ones to block and the ones that aren't listed such as googlebot would crawl the site since it doesn't encounter any reference to it?
Yes. Although it's best to include the two lines above to state that clearly. It may not be necessary for the bots but if nothing else it will remind you the human of what the robots.txt file is doing.

Edz
Dec 10th 2005, 8:58 am
Thank you minstrel for clearing that up.

Much appreciated:cool:

minstrel
Dec 10th 2005, 9:06 am
:) Happy to help, Edz.

mcfox
Dec 10th 2005, 9:31 am
Not thread related but Will, your site isn't displaying properly in Opera:

minstrel
Dec 10th 2005, 9:40 am
A while back, I installed Opera specifically because it seems to be the one that is most likely these days to break pages. I don't use it for general browsing but it can be instructive to get a copy to view your site... much like in the old days I used to keep a copy of Netscape 4.7x around as a worst case scenario browser.