A site map (or sitemap) is a list of pages of a web site accessible to crawlers or users. It can be either a document in any form used as a planning tool for Web design, or a Web page that lists the pages on a Web site, typically organized in hierarchical fashion. Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol. The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site. Both are usually located at the root of your site. Eg site.com/sitemap.xml & site.com/robots.txt Some sites such as WordPress can create a "virtual" robots text file. Now you know.
robots.txt and sitemap.xml are located in your root folder. robots.txt tells google, yahoo, bing crawlers what not to crawl ( take an inventory of ) on your pages. Robots.txt works in conjunction with your sitemap.xml file. The sitemap.xml tells the crawlers what pages you have and where they are located. If you do not have a robots.txt file, the crawler crawls all your site. If your robots.txt file has page A as a restriction and you declare page A on your sitemap.xml, you get a crawl error on your Google Webmaster Tools panel. An example of how we use it is the following. We have a test directory on our server where we upload sites to be tested before we go public. Obviously, we do not want the crawlers to take note ( inventory ) of this directory since it is a directory for testing only. Below is the robots.txt file contents. # www.robotstxt.org/ User-agent: * Disallow:/testing/ Good luck Tuhin. Braulio