What is Robots.txt?

Discussion in 'Google Sitemaps' started by mpreyesmr, Jul 26, 2009.

Thread Status:
Not open for further replies.
  1. #1
    What is robots.txt in sitemap? What exactly are their function?
     
    mpreyesmr, Jul 26, 2009 IP
  2. blue_angel

    blue_angel Well-Known Member

    Messages:
    1,174
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    130
    #2
    Robots, including search indexing tools and intelligent agents, should check a special file in the root of each server called robots.txt, which is a plain text file (not HTML). Robots.txt implements the REP (Robots Exclusion Protocol), which allows the web site administrator to define what parts of the site are off-limits to specific robot user agent names. Web administrators can Allow access to their web content and Disallow access to cgi, private and temporary directories, for example, if they do not want pages in those areas indexed.
     
    blue_angel, Jul 27, 2009 IP
  3. blue_angel

    blue_angel Well-Known Member

    Messages:
    1,174
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    130
    #3
    About the Robots.txt file

    The robots.txt file is divided into sections by the robot crawler's User Agent name. Each section includes the name of the user agent (robot) and the paths it may not follow. You should remember that robots may access any directory path in a URL which is not explicitly disallowed in this file: every path not forbidden is allowed.

    Note that disallowing robots is not the same as creating a secure area in your site, as only honorable robots will obey the directives and there are plenty of dishonorable ones. Anything you do not want to show to the entire World Wide Web, you should protect with at least a password.

    You can usually read this file by just requesting it from the server in a browser (for example, www.searchtools.com/robots.txt). If you click it, you'll see that it's a text file with many entries that I generated by looking at my server's error reports, because I wanted to avoid having those even occasionally requested by robots.

    The older version is documented in the REP (Robot Exclusion Protocol), and all robots should recognize and honor the rules in the robots.txt file. The New 2008 REP (Robot Exclusion Protocol) has additional features and may not be recognized by all robot crawlers.


    more here
    http://www.searchtools.com/robots/robots-txt.html
     
    blue_angel, Jul 27, 2009 IP
  4. freepost

    freepost Peon

    Messages:
    4
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #4
    permissions for search engines...
    Like This
    User-Agent: *
    Allow: /
    Disallow: /admin/
    Disallow: /other-folder-or-pages/
     
    freepost, Aug 1, 2009 IP
  5. MaxPowers

    MaxPowers Well-Known Member

    Messages:
    264
    Likes Received:
    5
    Best Answers:
    1
    Trophy Points:
    120
    #5
    The part that deals with sitemaps allows you to link to your sitemap (even if it's hosted on another domain) from within your robots.txt file.

    The syntax for this is the following line, all on it's own line, replace the URL with the actual URL to your sitemap...

    Sitemap: http://www.example.com/sitemap.xml

    This will tell the search engines where they can find your XML sitemap.
     
    MaxPowers, Aug 1, 2009 IP
  6. wayneonweb

    wayneonweb Peon

    Messages:
    131
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #6
    robot.txt otherwise known as "The Robots Exclusion Protocol" is nothing but giving instructions about your site to web robots. You have to create robot.txt with all your allow and disallow paths and need to be placed on your web server(http://www.example.com/robots.txt)

    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /tmp/

    the above mentioned example excludes all three mentioned folder.
     
    wayneonweb, Aug 5, 2009 IP
Thread Status:
Not open for further replies.