1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

What Is Robots.txt?

Discussion in 'robots.txt' started by VijayanKumar, Jun 9, 2011.

  1. #1
    Robots.txt

    It is great when search engines frequently visit your site and index your content but often there are cases when indexing parts of your online content is not what you want. For instance, if you have two versions of a page (one for viewing in the browser and one for printing), you'd rather have the printing version excluded from crawling, otherwise you risk being imposed a duplicate content penalty. Also, if you happen to have sensitive data on your site that you do not want the world to see, you will also prefer that search engines do not index these pages (although in this case the only sure way for not indexing sensitive data is to keep it offline on a separate machine). Additionally, if you want to save some bandwidth by excluding images, stylesheets and javascript from indexing, you also need a way to tell spiders to keep away from these items.

    One way to tell search engines which files and folders on your Web site to avoid is with the use of the Robots metatag. But since not all search engines read metatags, the Robots matatag can simply go unnoticed. A better way to inform search engines about your will is to use a robots.txt file.

    What Is Robots.txt?

    Robots.txt is a text (not html) file you put on your site to tell search robots which pages you would like them not to visit. Robots.txt is by no means mandatory for search engines but generally search engines obey what they are asked not to do. It is important to clarify that robots.txt is not a way from preventing search engines from crawling your site (i.e. it is not a firewall, or a kind of password protection) and the fact that you put a robots.txt file is something like putting a note “Please, do not enter” on an unlocked door – e.g. you cannot prevent thieves from coming in but the good guys will not open to door and enter. That is why we say that if you have really sen sitive data, it is too naïve to rely on robots.txt to protect it from being indexed and displayed in search results.

    The location of robots.txt is very important. It must be in the main directory because otherwise user agents (search engines) will not be able to find it – they do not search the whole site for a file named robots.txt. Instead, they look first in the main directory and if they don't find it there, they simply assume that this site does not have a robots.txt file and therefore they index everything they find along the way. So, if you don't put robots.txt in the right place, do not be surprised that search engines index your whole site.

    Structure of a Robots.txt File

    The structure of a robots.txt is pretty simple (and barely flexible) – it is an endless list of user agents and disallowed files and directories. Basically, the syntax is as follows:

    User-agent:

    Disallow:

    “User-agent” are search engines' crawlers and disallow: lists the files and directories to be excluded from indexing. In addition to “user-agent:” and “disallow:” entries, you can include comment lines – just put the # sign at the beginning of the line:

    # All user agents are disallowed to see the /temp directory.

    User-agent: *

    Disallow: /temp/

    The Traps of a Robots.txt File

    When you start making complicated files – i.e. you decide to allow different user agents access to different directories – problems can start, if you do not pay special attention to the traps of a robots.txt file. Common mistakes include typos and contradicting directives. Typos are misspelled user-agents, directories, missing colons after User-agent and Disallow, etc. Typos can be tricky to find but in some cases validation tools help.

    The more serious problem is with logical errors. For instance:

    User-agent: *

    Disallow: /temp/

    User-agent: Googlebot

    Disallow: /images/

    Disallow: /temp/

    Disallow: /cgi-bin/

    The above example is from a robots.txt that allows all agents to access everything on the site except the /temp directory. Up to here it is fine but later on there is another record that specifies more restrictive terms for Googlebot. When Googlebot starts reading robots.txt, it will see that all user agents (including Googlebot itself) are allowed to all folders except /temp/. This is enough for Googlebot to know, so it will not read the file to the end and will index everything except /temp/ - including /images/ and /cgi-bin/, which you think you have told it not to touch. You see, the structure of a robots.txt file is simple but still serious mistakes can be made easily.

    Tools to Generate and Validate a Robots.txt File

    Having in mind the simple syntax of a robots.txt file, you can always read it to see if everything is OK but it is much easier to use a validator.

    User agent: *

    Disallow: /temp/

    this is wrong because there is no slash between “user” and “agent” and the syntax is incorrect.

    In those cases, when you have a complex robots.txt file – i.e. you give different instructions to different user agents or you have a long list of directories and subdirectories to exclude, writing the file manually can be a real pain. But do not worry – there are tools that will generate the file for you. What is more, there are visual tools that allow to point and select which files and folders are to be excluded. But even if you do not feel like buying a graphical tool for robots.txt generation, there are online tools to assist you. For instance, the Server-Side Robots Generator offers a dropdown list of user agents and a text box for you to list the files you don't want indexed. Honestly, it is not much of a help, unless you want to set specific rules for different search engines because in any case it is up to you to type the list of directories but is more than nothing.
     
    VijayanKumar, Jun 9, 2011 IP
  2. apotohosting

    apotohosting Peon

    Messages:
    28
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Thanks for sharing an usful post:)
     
    apotohosting, Jun 10, 2011 IP
  3. jassicacute

    jassicacute Peon

    Messages:
    114
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    thanx for sharing robots.txt , i was not aware of multiple use of robots.txt
     
    jassicacute, Jun 10, 2011 IP
  4. sonoko125

    sonoko125 Active Member

    Messages:
    23
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    51
    #4
    In SEO. Robot.txt 's the way to control googlebot. It makes your seo campain more effective
     
    sonoko125, Jun 11, 2011 IP
  5. sunl

    sunl Peon

    Messages:
    106
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #5
    I have a Question. I setup blog install plugin which generate sitemap. That plugin change my robots.txt file like this

    User-agent: *
    Disallow:

    Sitemap: http://******.com/sitemap.xml.gz

    Now i want to know will Google crawl my blog and will it follow sitemap url ? As you can see user-agent is disallow.
     
    sunl, Jun 14, 2011 IP
  6. bobbydeo

    bobbydeo Peon

    Messages:
    65
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Google will crawl your site as the disallow is none. Also it will follow the sitemap url
     
    bobbydeo, Jun 22, 2011 IP
  7. upendraets

    upendraets Peon

    Messages:
    106
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #7
    it is a text file for allowing google or other search engine to crawl our site.
    we can also disallow any page for not to crawl. robots.txt sholud be add in root of the site.
     
    upendraets, Jun 25, 2011 IP
  8. tazseo

    tazseo Guest

    Messages:
    264
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Thanks for sharing the information.
     
    tazseo, Jun 28, 2011 IP
  9. t0l

    t0l Peon

    Messages:
    11
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #9
    use a Robots.txt to stop Google is not a very good idea.
    I suggest just put the pages you need on you site. this would make your site safe and keep hackers from your sites.....
     
    t0l, Jul 5, 2011 IP
  10. endu202

    endu202 Peon

    Messages:
    14
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #10
    Nice Post..Very educated
     
    endu202, Jul 7, 2011 IP
  11. VijayanKumar

    VijayanKumar Active Member

    Messages:
    63
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    93
    #11
    thanks to everyone ...
     
    VijayanKumar, Jul 8, 2011 IP
  12. VijayanKumar

    VijayanKumar Active Member

    Messages:
    63
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    93
    #12
    you are right, but in this concept not preferable for dynamic contents.
     
    VijayanKumar, Jul 8, 2011 IP
  13. VijayanKumar

    VijayanKumar Active Member

    Messages:
    63
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    93
    #13
    you are right, but in this concept not preferable for dynamic contents.
     
    VijayanKumar, Jul 8, 2011 IP
  14. VijayanKumar

    VijayanKumar Active Member

    Messages:
    63
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    93
    #14
    Keep in touch me guys...
     
    VijayanKumar, Jul 8, 2011 IP
  15. VijayanKumar

    VijayanKumar Active Member

    Messages:
    63
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    93
    #15
    thanks to your reply...
     
    VijayanKumar, Jul 8, 2011 IP
  16. rowza

    rowza Well-Known Member

    Messages:
    189
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    125
    #16
    Thanks ma, Should be stickied
     
    rowza, Jul 9, 2011 IP
  17. cereseo

    cereseo Peon

    Messages:
    50
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #17
    nice info about Robots.txt
    i was confused and by reading to your thread,i had clear my doubt about Robots.txt
     
    cereseo, Jul 13, 2011 IP
  18. Billy_Bowden

    Billy_Bowden Peon

    Messages:
    25
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #18
    Robots.txt is just a file to upload in ftp for control your page in search engine
     
    Billy_Bowden, Jul 15, 2011 IP
  19. badcarbon1

    badcarbon1 Peon

    Messages:
    1
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #19
    Thanks.. Usefull Info
     
    badcarbon1, Jul 18, 2011 IP
  20. kangtj

    kangtj Member

    Messages:
    143
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    26
    #20
    great post and nice information :)
     
    kangtj, Jul 18, 2011 IP