1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

The most ideal robots.txt? how should it look like?

Discussion in 'robots.txt' started by Edz, Nov 24, 2005.

  1. #1
    Today i'm probably finally gonna put a site on that i'm working on for some while now and need to put a robots.txt file in it and really have no expierence with this unfortunately:(

    With all these spiders and crawlers that don't mean anything for you in a productive way for your site i wanted to know how i would build the most ideal robots.txt and how i should go about it?

    I want to give access to the most welknown search engine crawlers and spiders and the ones that are not so familiar but would be a good decision to allow.
    And block unwanted bandwith eating no good spiders and crawlers for my site.

    What would your approach be guys?
    How can i handle this in the best way?
     
    Edz, Nov 24, 2005 IP
  2. evilmonkeyspanker

    evilmonkeyspanker Peon

    Messages:
    276
    Likes Received:
    15
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Well, this is what googles looks like

    User-agent: *
    Allow: /searchhistory/
    Disallow: /search
    Disallow: /groups
    Disallow: /images
    Disallow: /catalogs
    Disallow: /catalog_list
    Disallow: /news
    Disallow: /nwshp
    Disallow: /?
    Disallow: /addurl/image?
    Disallow: /pagead/
    Disallow: /relpage/
    Disallow: /sorry/
    Disallow: /imgres
    Disallow: /keyword/
    Disallow: /u/
    Disallow: /univ/
    Disallow: /cobrand
    Disallow: /custom
    Disallow: /advanced_group_search
    Disallow: /advanced_search
    Disallow: /googlesite
    Disallow: /preferences
    Disallow: /setprefs
    Disallow: /swr
    Disallow: /url
    Disallow: /wml?
    Disallow: /xhtml?
    Disallow: /imode?
    Disallow: /jsky?
    Disallow: /pda?
    Disallow: /sprint_xhtml
    Disallow: /sprint_wml
    Disallow: /pqa
    Disallow: /palm
    Disallow: /gwt/
    Disallow: /purchases
    Disallow: /hws
    Disallow: /bsd?
    Disallow: /linux?
    Disallow: /mac?
    Disallow: /microsoft?
    Disallow: /unclesam?
    Disallow: /answers/search?q=
    Disallow: /local?
    Disallow: /local_url
    Disallow: /froogle?
    Disallow: /froogle_
    Disallow: /print?
    Disallow: /scholar?
    Disallow: /complete
    Disallow: /sponsoredlinks
    Disallow: /videosearch?
    Disallow: /videopreview?
    Disallow: /videoprograminfo?
    Disallow: /maps?
    Disallow: /translate?
    Disallow: /ie?
    Disallow: /sms/demo?
    Disallow: /katrina?
    Disallow: /blogsearch?
    Disallow: /reader/
    Disallow: /chart?
     
    evilmonkeyspanker, Nov 24, 2005 IP
  3. Edz

    Edz Peon

    Messages:
    1,690
    Likes Received:
    72
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Holy s*%t, that alone from Google itself?

    That would be a very long list in it's totallity:eek:

    How do you guys manage that? I mean surely you aren't allowing every crawler and spider?
     
    Edz, Nov 24, 2005 IP
  4. Dejavu

    Dejavu Peon

    Messages:
    916
    Likes Received:
    53
    Best Answers:
    0
    Trophy Points:
    0
    #4
    That is the robots.txt that is on googles server, not what you should add..
    (http://www.google.com/robots.txt)
    It is probably ok to just leave the file empty, or to allow every robot. I never had a reason to block a bot, and the really nasty bots will ignore robots.txt anyhow.
     
    Dejavu, Nov 24, 2005 IP
  5. Edz

    Edz Peon

    Messages:
    1,690
    Likes Received:
    72
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Yeah,i guess the nasty bots ignore the file anyways as i have seen discussed in other threads before.

    I don't have any pages that i have a problem with getting indexed so would i need a robots.txt file in this case?

    And if i want to allow every bot how should i make the robots.txt file?

    Would it be a good idea to make a sticky about this subject since this won't be the first time this is being or going to be asked.

    With the instructions how to set something like this up and what kind of options there are for the various instructions that are possible with setting up a robots.txt file.

    And maybe do's and dont's?

    Just an idea because i have no clue at this point where to begin and how i need to set this up.
     
    Edz, Nov 24, 2005 IP
  6. Dekker

    Dekker Peon

    Messages:
    4,185
    Likes Received:
    287
    Best Answers:
    0
    Trophy Points:
    0
    #6
    no

    no.

    if you don't have a robots file you don't have any lmiitations controlling what robots can do when they crawl your site.
     
    Dekker, Nov 24, 2005 IP
  7. Dejavu

    Dejavu Peon

    Messages:
    916
    Likes Received:
    53
    Best Answers:
    0
    Trophy Points:
    0
    #7
    Well, without a robots.txt your error logs might get filled up with all the requests to nonexistent file, but thats the only disadvantage I can think of..
    Sticky seems like a good idea, I also dont know what the ideal robots.txt should look like.
     
    Dejavu, Nov 24, 2005 IP
  8. Dekker

    Dekker Peon

    Messages:
    4,185
    Likes Received:
    287
    Best Answers:
    0
    Trophy Points:
    0
    #8
    ohhh...been wondering what that is

    you can then just allow all robots to the / folder, which will give full access to everything.
     
    Dekker, Nov 24, 2005 IP
  9. Edz

    Edz Peon

    Messages:
    1,690
    Likes Received:
    72
    Best Answers:
    0
    Trophy Points:
    0
    #9
    Never thought this could also cause errors? A bit confused on this one:confused:

    Ideal is a bit asked to much i guess;) because this would vary from person to person for what would be ideal but something close to ideal or what to look out for is very welcome information for everyone wondering about this subject;)

    Comeone guys let's make a sticky about this:)
     
    Edz, Nov 24, 2005 IP
  10. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #10
    There are some directories you don't want crawled because you don't want the spiders wasting time in there and you don't want the SEs indexing stuff people are never going to see or stuff that has no content.

    A forum is a good case in point: You want to restrict certain files and directories to concentrate the spiders on the ones that matter. Mine looks like this:

    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /images/
    Disallow: /media/
    Disallow: /misc/
    Disallow: /stats/
    Disallow: /phpbb/admin/ 
    Disallow: /phpbb/db/ 
    Disallow: /phpbb/images/ 
    Disallow: /phpbb/includes/ 
    Disallow: /phpbb/language/ 
    Disallow: /phpbb/profile.php 
    Disallow: /phpbb/groupcp.php 
    Disallow: /phpbb/memberlist.php 
    Disallow: /phpbb/login.php 
    Disallow: /phpbb/modcp.php 
    Disallow: /phpbb/posting.php 
    Disallow: /phpbb/privmsg.php 
    Disallow: /phpbb/search.php 
    
    Code (markup):
     
    minstrel, Nov 24, 2005 IP
  11. amitpatel_3001

    amitpatel_3001 Results Follow Patience

    Messages:
    14,074
    Likes Received:
    1,178
    Best Answers:
    0
    Trophy Points:
    430
    #11
    For this Site

    iam using this.

    User-agent: *
    Disallow: /images/

    Is this enough if i want all the Bots to see my site and get my every page indexed?
     
    amitpatel_3001, Nov 24, 2005 IP
  12. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #12
    That will allow spidering of everything except what is in your /images folder.
     
    minstrel, Nov 24, 2005 IP
  13. amitpatel_3001

    amitpatel_3001 Results Follow Patience

    Messages:
    14,074
    Likes Received:
    1,178
    Best Answers:
    0
    Trophy Points:
    430
    #13
    So , this means all the Search engines will index my pages regularly?
    Which is what i need
     
    amitpatel_3001, Nov 24, 2005 IP
  14. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #14
    Probably.

    What it actually means is that there is nothing in your robots.txt file that blocks spiders from anything except your images folder.

    It's always possible there is something else about your site or your hosting that may be a problem but the robots.txt file is fine.
     
    minstrel, Nov 25, 2005 IP
  15. Edz

    Edz Peon

    Messages:
    1,690
    Likes Received:
    72
    Best Answers:
    0
    Trophy Points:
    0
    #15
    So, nobody would like to see a sticky coming about this i presume?
     
    Edz, Nov 26, 2005 IP
  16. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #16
    When I was a moderator at WPW, we created a sticky on the robots.txt file. People still kept asking the questions - granted, we could refer them to the sticky but then you still have to answer additional questions that aren't clear in the sticky or about unique circumstances or about issues that won't be fixed by a robots.txt file.
     
    minstrel, Nov 26, 2005 IP
    Will.Spencer likes this.
  17. Will.Spencer

    Will.Spencer NetBuilder

    Messages:
    14,789
    Likes Received:
    1,040
    Best Answers:
    0
    Trophy Points:
    375
    #17
    Will.Spencer, Nov 26, 2005 IP
  18. Edz

    Edz Peon

    Messages:
    1,690
    Likes Received:
    72
    Best Answers:
    0
    Trophy Points:
    0
    #18
    Thanks Will, that some good info overthere i put the site in my ''personal'' SEO toolbox, something i am working on lately;)

    To be clear on something...

    I only have to remove the asterix right? and replace it with the useragents in the *bad bots* list on your site?

    # go away
    User-agent: *
    Disallow: /

    Or do i have to keep this line
    in place and start lining the bad robots up from under there?

    Also can i make this file trough Wordpad?

    This method also sounds good only i didn't grasp how to implement this by looking at the website.

    Also even though questions will still be asked regarding the robots.txt file i think a lot of future questions can be answered by planting a sticky regarding this subject.
    And questions still being asked even though they are explained fully in a sticky is probably a reoccuring problem that can't be avoided i think.

    Questions that are still being asked can be valid though if the sticky doesn't fully cover the options of setting up a robots.txt file and will only improve the quality of such an information source.

    Also since DP is getting more popular in time, querries made in search engines regarding this subject by beginning webmasters such as myself can increase the growth of DP'forums because of some of the results refering to DP:)

    Don't know if Shawn is down for DP expanding even more but if he is it's a good oppurtunity i guess.

    Ok, enough for the sales pitch:D
     
    Edz, Nov 26, 2005 IP
  19. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #19
    Bear in mind that the bad bots will usually ignore robots.txt anyway so it's sort of a waste of time adding them to your robots.txt file.

    Use Notepad instead and make sure you're saving it as plain text (ANSI or ASCII).
     
    minstrel, Nov 26, 2005 IP
  20. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #20
    Will, there are a few odd ones in your list - I realize it came from funender but he has given some dubious advice on other forums as well - you may want to edit that list.

    1. GetRight - this is an FTP download program (which I use myself for downloading updates, etc., on dial-up) - why would you block this?
    2. :confused: FrontPage - why would this even appear unless you yourself are using FrontPage yourself?
    3. Xenu and Xenu Link Sleuth - this is a popular freeware link checker - again, I use it myself about once a month or so to scan for broken or redirected links on my site - if you ban this and someone who links to you uses Xenu, you stand a very good chance of losing backlinks.
    4. why is larbin there?
    5. the lwp-trivial are not going to do anything - that's a string used by one of the forum-attacker worms and it sure as hell isn't going to stop to read your robots.txt file

    There are a few others there that are benign and, as I said to Edz, most of the actual bad bots listed there are not going to obey robots.txt file directives anyway. It's a waste of time using robots.txt files like this, IMO.
     
    minstrel, Nov 26, 2005 IP