Robots.txt

elle19570 Peon

Messages:: 59

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#1

Hi,

Writing robots.txt in following format is OK or not please guide

User-agent: *
Disallow: /cgi-bin/
Disallow:

If I disallowed following crawler (bandwidth eating crawler) in my robots.txt, will it affect my crawling:

User-agent: Flashget
Disallow: /

User-agent: Offline
Disallow: /

User-agent: Teleport
Disallow: /

User-agent: Downloader
Disallow: /

User-agent: reaper
Disallow: /

User-agent: WebZIP
Disallow: /

User-agent: Website Quester
Disallow: /

User-agent: MSIECrawler
Disallow: /

User-agent: FAST-WebCrawler
Disallow: /

User-agent: Gulliver
Disallow: /

User-agent: WebCapture
Disallow: /

User-agent: HTTrack
Disallow: /

User-agent: Fetch API Request
Disallow: /

User-agent: NetAnts
Disallow: /

User-agent: SuperBot
Disallow: /

User-agent: WebCopier
Disallow: /

User-agent: WebStripper
Disallow: /

User-agent: Wget
Disallow: /

User-agent: EmailSiphon
Disallow: /

User-agent: MSProxy/2.0
Disallow: /

User-agent: EmailWolf
Disallow: /

User-agent: webbandit
Disallow: /

User-agent: MS FrontPage
Disallow: /

elle19570, Oct 13, 2006 IP

Cryogenius Peon

Messages:: 1,280

Likes Received:: 118

Best Answers:: 0

Trophy Points:: 0

#2

You will only stop those bad crawlers if they bother to check your robots.txt file. For example, still I could use 'wget' on your site to download every webpage to my computer.

If you are really worried about it, then look into using a .htaccess file to block those user agents.

Cryo.

Cryogenius, Oct 13, 2006 IP

Jean-Luc Peon

Messages:: 601

Likes Received:: 30

Best Answers:: 0

Trophy Points:: 0

#3

Hi,
User-agent: *
Disallow: /cgi-bin/
Disallow:
Code (markup):
This is not correct.

Disallow: allows access to all URL's. If you use it, you should not disallow anything else within the same group of directives.
User-agent: *
Disallow: /cgi-bin/
Code (markup):
This is the correct way to allow access to all URL's but the ones starting with /cgi-bin/.

Jean-Luc

Jean-Luc, Oct 13, 2006 IP

For root robots.txt it's advisable not disclose which directories you are trying to prevent access because anyone can look into that file to find them out just pointing their browsers to www.your_domain.ext/robots.txt

Pat Gael, Oct 13, 2006 IP

Jean-Luc Peon

Messages:: 601

Likes Received:: 30

Best Answers:: 0

Trophy Points:: 0

#5

Pat Gael said: ↑

My robots.txt is like this:

(...)

However the above is not to be placed in the root directory, but in each of those directories that I need to block (...).
Click to expand...

Robots only look at the robots.txt file in the root directory. If you place robots.txt files in other directories, no robot will look at these files.

On top of that, would your robots.txt file be placed in the root directory, it would disallow all robots everywhere in the site,exactly like this one would do:
User-agent: *
Disallow: /
Code (markup):
Jean-Luc

Jean-Luc, Oct 13, 2006 IP

elle19570 Peon

Messages:: 59

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#6

Thank you very much for the guidance

elle19570, Oct 16, 2006 IP

Log in or Sign up

Robots.txt

elle19570 Peon

Cryogenius Peon

Jean-Luc Peon

Pat Gael Banned

Jean-Luc Peon

elle19570 Peon

Log in or Sign up

Robots.txt

elle19570 Peon

Cryogenius Peon

Jean-Luc Peon

Pat Gael Banned

Jean-Luc Peon

elle19570 Peon

Useful Searches