Hi, Writing robots.txt in following format is OK or not please guide User-agent: * Disallow: /cgi-bin/ Disallow: If I disallowed following crawler (bandwidth eating crawler) in my robots.txt, will it affect my crawling: User-agent: Flashget Disallow: / User-agent: Offline Disallow: / User-agent: Teleport Disallow: / User-agent: Downloader Disallow: / User-agent: reaper Disallow: / User-agent: WebZIP Disallow: / User-agent: Website Quester Disallow: / User-agent: MSIECrawler Disallow: / User-agent: FAST-WebCrawler Disallow: / User-agent: Gulliver Disallow: / User-agent: WebCapture Disallow: / User-agent: HTTrack Disallow: / User-agent: Fetch API Request Disallow: / User-agent: NetAnts Disallow: / User-agent: SuperBot Disallow: / User-agent: WebCopier Disallow: / User-agent: WebStripper Disallow: / User-agent: Wget Disallow: / User-agent: EmailSiphon Disallow: / User-agent: MSProxy/2.0 Disallow: / User-agent: EmailWolf Disallow: / User-agent: webbandit Disallow: / User-agent: MS FrontPage Disallow: /
You will only stop those bad crawlers if they bother to check your robots.txt file. For example, still I could use 'wget' on your site to download every webpage to my computer. If you are really worried about it, then look into using a .htaccess file to block those user agents. Cryo.
Hi, User-agent: * Disallow: /cgi-bin/ Disallow: Code (markup): This is not correct. Disallow: allows access to all URL's. If you use it, you should not disallow anything else within the same group of directives. User-agent: * Disallow: /cgi-bin/ Code (markup): This is the correct way to allow access to all URL's but the ones starting with /cgi-bin/. Jean-Luc
My robots.txt is like this: User-agent: * User-agent: Googlebot-Image User-Agent: Googlebot User-agent: Mediapartners-Google/2.1 User-agent: Mediapartners-Google* User-agent: MSNBot User-agent: msnbot-NewsBlogs User-agent: Slurp User-agent: yahoo-mmcrawler User-agent: yahoo-blogs/v3.9 User-agent: Gigabot User-agent: ia_archiver User-agent: BotRightHere User-agent: larbin User-agent: b2w/0.1 User-agent: Copernic User-agent: psbot User-agent: Python-urllib User-agent: NetMechanic User-agent: URL_Spider_Pro User-agent: CherryPicker User-agent: EmailCollector User-agent: EmailSiphon User-agent: WebBandit User-agent: EmailWolf User-agent: ExtractorPro User-agent: CopyRightCheck User-agent: Crescent User-agent: SiteSnagger User-agent: ProWebWalker User-agent: CheeseBot User-agent: LNSpiderguy User-agent: Alexibot User-agent: Teleport User-agent: TeleportPro User-agent: MIIxpc User-agent: Telesoft User-agent: Website Quester User-agent: WebZip User-agent: moget/2.1 User-agent: WebZip/4.0 User-agent: WebStripper User-agent: WebSauger User-agent: WebCopier User-agent: NetAnts User-agent: Mister PiX User-agent: WebAuto User-agent: TheNomad User-agent: WWW-Collector-E User-agent: RMA User-agent: libWeb/clsHTTP User-agent: asterias User-agent: httplib User-agent: turingos User-agent: spanner User-agent: InfoNaviRobot User-agent: Harvest/1.5 User-agent: Bullseye/1.0 User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95) User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0 User-agent: CherryPickerSE/1.0 User-agent: CherryPickerElite/1.0 User-agent: WebBandit/3.50 User-agent: NICErsPRO User-agent: DittoSpyder User-agent: Foobot User-agent: SpankBot User-agent: BotALot User-agent: lwp-trivial/1.34 User-agent: lwp-trivial User-agent: BunnySlippers User-agent: URLy Warning User-agent: Wget/1.6 User-agent: Wget/1.5.3 User-agent: Wget User-agent: LinkWalker User-agent: cosmos User-agent: moget User-agent: hloader User-agent: humanlinks User-agent: LinkextractorPro User-agent: Offline Explorer User-agent: Mata Hari User-agent: LexiBot User-agent: Web Image Collector User-agent: The Intraformant User-agent: True_Robot/1.0 User-agent: True_Robot User-agent: BlowFish/1.0 User-agent: JennyBot User-agent: MIIxpc/4.2 User-agent: BuiltBotTough User-agent: ProPowerBot/2.14 User-agent: BackDoorBot/1.0 User-agent: toCrawl/UrlDispatcher User-agent: suzuran User-agent: TightTwatBot User-agent: VCI WebViewer VCI WebViewer Win32 User-agent: VCI User-agent: Szukacz/1.4 User-agent: Openfind data gatherer User-agent: Openfind User-agent: Xenu's Link Sleuth 1.1c User-agent: Xenu's User-agent: Zeus User-agent: RepoMonkey Bait & Tackle/v1.01 User-agent: RepoMonkey User-agent: Openbot User-agent: URL Control User-agent: Zeus Link Scout User-agent: Zeus 32297 Webster Pro V2.9 Win32 User-agent: Webster Pro User-agent: EroCrawler User-agent: LinkScan/8.1a Unix User-agent: Keyword Density/0.9 User-agent: Kenjin Spider User-agent: Iron33/1.0.2 User-agent: Bookmark search tool User-agent: GetRight/4.2 User-agent: FairAd Client User-agent: Gaisbot User-agent: Aqua_Products User-agent: Radiation Retriever 1.1 User-agent: Flaming AttackBot User-agent: Curl User-agent: Web Reaper User-agent: Firefox User-agent: Opera User-agent: Netscape User-agent: WebVulnCrawl User-agent: WebVulnScan Disallow: / Code (markup): However the above is not to be placed in the root directory, but in each of those directories you don't want to be crawled, in addition to .htaccess this way: order deny,allow deny from all Code (markup): For root robots.txt it's advisable not disclose which directories you are trying to prevent access because anyone can look into that file to find them out just pointing their browsers to www.your_domain.ext/robots.txt
Robots only look at the robots.txt file in the root directory. If you place robots.txt files in other directories, no robot will look at these files. On top of that, would your robots.txt file be placed in the root directory, it would disallow all robots everywhere in the site,exactly like this one would do: User-agent: * Disallow: / Code (markup): Jean-Luc