What is robots.txt and should I modify it?

Discussion in 'robots.txt' started by pr0xy122, Oct 25, 2006.

  1. #1
    What is robots.txt and should I modify it? is it like the SEs crawler or something? :confused:
     
    pr0xy122, Oct 25, 2006 IP
  2. Sean Man

    Sean Man Guest

    Messages:
    10
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Some general reading:
    http://en.wikipedia.org/wiki/Robots.txt

    Essentially, it's a text file on the root of your domain that disallows/permits specific bots for accessing specific directories of your website.

    For example:
    User-agent: *
    Disallow: /
    HTML:
    Disallows all bots from indexing any of your site.


    User-agent: googlebot
    Disallow:
    
    User-agent: *
    Disallow: /
    Code (markup):
    This would allow google full access, but disallow others.
     
    Sean Man, Nov 2, 2006 IP
  3. trichnosis

    trichnosis Prominent Member

    Messages:
    13,785
    Likes Received:
    333
    Best Answers:
    0
    Trophy Points:
    300
    #3
    trichnosis, Nov 14, 2006 IP
  4. michael11

    michael11 Peon

    Messages:
    234
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #4
    I have one question.

    Let's say I have hidden folders in my root directory and simply don't want SE's to spider and index it, i.e. I sell an ebook or software which is stored in a directory and simply don't want to be found by SE results.

    Now some guys are not dumb and would try out the following: www.mydomains.com/robots.txt and would see which directories are hidden and could easily access files I wanted to protect.

    What I do is store the thankyou/download page on another domain but nonetheless some might just scan thru maybe 100's or 1000's domains blindly and be lucky to find some valuable content somewhere.

    How do you protect this? Can't you make it that someone who types in the robots.txt URL directly into the address bar that he gets either an error page or gets directed somewhere else?

    Thanks for your input and ideas!
     
    michael11, Nov 25, 2006 IP
  5. michaelp

    michaelp Peon

    Messages:
    27
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #5
    I've been wondering about this too actually.

    One thing you can do to protect your disallow folder is to make sure that who type in mydomain.com/myhiddenfolder/ won't see the contents of the directory. This can be done either by placing an empty index.html file, or setting the server permissions to disallow directory listings entirely.

    You may want to take a look at .htaccess files for that.

    To protect your ebooks and whatnot, you may want to look into some third-party php/asp scripts. There should be some that will be able to secure your content.

    Another possibility for protecting the robots.txt file is to have files with the .txt extension resolve to a php/perl script -- similiar to how some sites have .html files parsed as php files. Then, you can take a look at the user-agent and if it's a bot, display robots.txt; otherwise, display something else.

    ...But I'm not sure if Google would consider that as cloaking.

    I hope this helps,
    Michael
     
    michaelp, Nov 26, 2006 IP
  6. swollenpickles

    swollenpickles Active Member

    Messages:
    1,271
    Likes Received:
    23
    Best Answers:
    0
    Trophy Points:
    80
    #6
    So is this what you would do if you had an images directory that you wouldn't want google to look at individually?
     
    swollenpickles, Nov 26, 2006 IP
  7. michaelp

    michaelp Peon

    Messages:
    27
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #7
    Yes, exactly.

    You would have a "Disallow /images/" line in your robots.txt

    ~Michael
     
    michaelp, Nov 27, 2006 IP
    swollenpickles likes this.
  8. eTIME

    eTIME Banned

    Messages:
    128
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #8
    robot.txt is for allow and disallow folders or files to bots and spiders.
     
    eTIME, Jan 29, 2007 IP
  9. jonbt

    jonbt Peon

    Messages:
    11
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #9
    When I want to block certain files or directories with robots.txt but I don't want others to see the filenames I use wildcard in robots.txt


    From google:

    To block all urls that have a ? in the url
    Disallow: /*?*
    Code (markup):


    From me:

    So to block /admin/login.php do
    Disallow: /adm
    Disallow: /adm*
    Code (markup):
    block all files in /images/ directory
    Disallow: /imag
    Disallow: /imag*
    Code (markup):
     
    jonbt, Jan 30, 2007 IP