Robots.txt Guide

Boardwalk Well-Known Member

Messages:: 1,651

Likes Received:: 44

Best Answers:: 0

Trophy Points:: 140

#1

Hi, I've published a post in this SEO blog and it's a Robots.txt Guide. It's a very comprehensive (well, that's how I find it) guide for people who want to tweak and edit their robots.txt file. I've also included a list of user-agents for your use.

Here are some questions for you:
Do you use a robots.txt file?
- If yes, do you have a list of "bad bots" that you disallow?
-- If yes, please share it with us. Thanks!

After you read it, please let me know if you found it helpful... If you liked the guide, feel free to comment on it. Thanks.

P.S. If you know of some other user-agents that I missed, please let me know!

Boardwalk, Oct 16, 2007 IP

hans Well-Known Member

Messages:: 2,923

Likes Received:: 126

Best Answers:: 1

Trophy Points:: 173

#2

yes i use robots.txt

however
good bots follow the rules
bad bots seldom or never
i use robots.txt for G/Y/(MSN?) new version of finding sitemaps / RSS feeds
hence i have a line
Sitemap: /sitemapindex.xml
in my robots.txt
and maintain the Sitemap: /sitemapindex.xml
to avoid submitting sitemaps or RSS feeds to major SE

hans, Oct 21, 2007 IP

Boardwalk Well-Known Member

Messages:: 1,651

Likes Received:: 44

Best Answers:: 0

Trophy Points:: 140

#3

Okay.. nice.

Boardwalk, Oct 21, 2007 IP

cooldude7273 Active Member

Messages:: 185

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 55

#4

I was actually unaware that a host could disallow you using robots.txt?

cooldude7273, Nov 1, 2007 IP

hans Well-Known Member

Messages:: 2,923

Likes Received:: 126

Best Answers:: 1

Trophy Points:: 173

#5

Boardwalk said: ↑

Here are some questions for you:
Do you use a robots.txt file?
- If yes, do you have a list of "bad bots" that you disallow?
-- If yes, please share it with us. Thanks!

P.S. If you know of some other user-agents that I missed, please let me know!
Click to expand...

here my robots.txt disallow list:

User-agent: e-SocietyRobot
Disallow: /

User-agent: psbot
Disallow: /

User-agent: yacybot
Disallow: /

User-agent: ConveraCrawler
Disallow: /

User-agent: MJ12bot
Disallow: /

above I have updated just a few days ago with following criteria in mind
all above have LARGE numbers of months crawl activities in the thousands/m, I have visited the homepage of each of them, and studied the goal and purpose of each, then made my final decision - based on:
if one pretends to be for a new SE that is already UP - has many thousands of crawls / m and fails to provide even a single decent result for a major keywrod of my site - disallow.
if a bot is for an obscure "ecomerce" or "society" or the use of the search / crawl results is limited to a restricted NON-public group only or without open public use and that "society" or project does NOT appear on my referrer URL list = disallowed.

user agent disallowed in my .htaccess are:

Indy Library
ibwww-perl/5.79
WebImages
Wget
Offline Navigator
Xaldon WebSpider

the first above is used on my site by the ten thousands a month for massive NON-human page-loads by chinese networds across all CN to create fake traffic totaling in excess of hundred thousand pageviews/m on a few selected URLs only.
most of these CN originating activities come from a LARGE ( hundreds ) nr of IPs and for that reason the majority of such activities is blocked by iptables. to find these IPs, I use to find my .htaccess deny list for those user agents to find the IP ranges, then create new IP rules to ground entire CN-networks.

above agent list now is in use for about 1 year, every now and then ( may be monthly if I have time ) I remove that deny list and watch if I have even the slightest increae in real-human traffic - NONE so far = hence NO loss to block entire A or B networks from CN.

others have been used to mass download images - I recently just days ago found entire site-sections MF advertisments ( MFA and others ) made up ONLY with stolen image content from my site, even worst all images hotlinked on copyright infringer's site !! hence I strictly disallow any special image-agent OTHER than the major SE.

wite mirror tools like wget ( I use it myself if I have need to get an open source howto on a subject or so, ) have caused sometimes looping downloads up to ten thousands of occurrences within hours, hence I block them since i offer a complete mirror of my content as download in .zip format.

I am just these days to make annual inventory on hackers, copyright infringers ( 3 entire sites have been grounded the last few days based on my "inventory" ) and will publish som emore details and copy/paste stuff for htaccess and/or robots.txt in my blog secrets of love / section Internet/SEO.

how to find out if you are abused by any of above user agents or bots ?
simply look at your access_log stats in details
OR
use zgrep for your logfiles, at least a week or month logfiles - and see how many visits you have and if you need any deny or exclusion rules above.
example usage of zgrep to search log files would be:

cd to directory of your log files, then in bash:

zgrep "Indy Library" access_log-200611*.gz | wc -l
88358
88358 = number ( count ) of occurrences from fake traffic originating CN last NOV 2006,
last months it is
zgrep "Indy Library" access_log-200710*.gz | wc -l
12440
the difference between last Nov and this Oct is clocked by my iptables, this remaining 12440 by my .htaccess deny rules.

and since this thread also is indirectly about abuse of crawls and bandwidth, etc I am just these days inventuring all my hotlink top list ( world champion is once more myspace.com with ten thousands of hotlinked (often full size wallpapers ) images/month.

the details and how to find out with exact bash commands will soon be published in my blog for use on other ppls servers/sites.

RE your POLL question:
Does your host allow you to upload a robots.txt file?

it may as well be illegal to DISALLOW use of robots.txt by common law as robots.txt is one globally practiced method to protect site-content and abuse of bandwidth as well as to protect restricted areas from being crawled, like admin areas or cgi-bin, etc
I have never heard of any host disallowing use of robots.txt OR .htaccess since BOTH are NEEDED for the security, protection and well functioning/operation of an entire site or parts thereof.

hans, Nov 1, 2007 IP

Boardwalk Well-Known Member

Messages:: 1,651

Likes Received:: 44

Best Answers:: 0

Trophy Points:: 140

#6

Thanks for that Hans.

There are some hosts which do not allow robots.txt (some free hosts, I think)...

Boardwalk, Nov 1, 2007 IP

hans Well-Known Member

Messages:: 2,923

Likes Received:: 126

Best Answers:: 1

Trophy Points:: 173

#7

free hosts ...
who is nowadays still using free hosts if regular quality hosting is just a tipping compared to adsense revenue potential. ...
and for all serious webmasters robots.txt is a need to comply with G procedures to have death URLs removed ..
hence there would be no professional use possible without both robots and htaccess files.
if a host disallows robots.txt hen may be because they need to crawl entires sites for legitimate reasons of liability protection to avoid site abuse by illegal content - but then again a qualified host can do such self-control much faster directly on the HDD rather than via www.

hans, Nov 1, 2007 IP

Log in or Sign up

Robots.txt Guide

Does your host allow you to upload a robots.txt file?

Yes

No

Umm... I don't know...

Boardwalk Well-Known Member

hans Well-Known Member

Boardwalk Well-Known Member

cooldude7273 Active Member

hans Well-Known Member

Boardwalk Well-Known Member

hans Well-Known Member

Log in or Sign up

Robots.txt Guide

Does your host allow you to upload a robots.txt file?

Yes

No

Umm... I don't know...

Boardwalk Well-Known Member

hans Well-Known Member

Boardwalk Well-Known Member

cooldude7273 Active Member

hans Well-Known Member

Boardwalk Well-Known Member

hans Well-Known Member

Useful Searches