1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Robots.txt Guide

Discussion in 'robots.txt' started by Boardwalk, Oct 16, 2007.

?

Does your host allow you to upload a robots.txt file?

  1. Yes

    86.7%
  2. No

    0 vote(s)
    0.0%
  3. Umm... I don't know...

    13.3%
  1. #1
    Hi, I've published a post in this SEO blog and it's a Robots.txt Guide. It's a very comprehensive (well, that's how I find it) guide for people who want to tweak and edit their robots.txt file. I've also included a list of user-agents for your use. :)

    Here are some questions for you:
    Do you use a robots.txt file?
    - If yes, do you have a list of "bad bots" that you disallow?
    -- If yes, please share it with us. Thanks!

    After you read it, please let me know if you found it helpful... If you liked the guide, feel free to comment on it. Thanks.

    P.S. If you know of some other user-agents that I missed, please let me know!
     
    Boardwalk, Oct 16, 2007 IP
  2. hans

    hans Well-Known Member

    Messages:
    2,923
    Likes Received:
    126
    Best Answers:
    1
    Trophy Points:
    173
    #2
    yes i use robots.txt

    however
    good bots follow the rules
    bad bots seldom or never
    i use robots.txt for G/Y/(MSN?) new version of finding sitemaps / RSS feeds
    hence i have a line
    Sitemap: /sitemapindex.xml
    in my robots.txt
    and maintain the Sitemap: /sitemapindex.xml
    to avoid submitting sitemaps or RSS feeds to major SE
     
    hans, Oct 21, 2007 IP
  3. Boardwalk

    Boardwalk Well-Known Member

    Messages:
    1,651
    Likes Received:
    44
    Best Answers:
    0
    Trophy Points:
    140
    #3
    Okay.. nice. :)
     
    Boardwalk, Oct 21, 2007 IP
  4. cooldude7273

    cooldude7273 Active Member

    Messages:
    185
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    55
    #4
    I was actually unaware that a host could disallow you using robots.txt?
     
    cooldude7273, Nov 1, 2007 IP
  5. hans

    hans Well-Known Member

    Messages:
    2,923
    Likes Received:
    126
    Best Answers:
    1
    Trophy Points:
    173
    #5
    here my robots.txt disallow list:

    User-agent: e-SocietyRobot
    Disallow: /

    User-agent: psbot
    Disallow: /

    User-agent: yacybot
    Disallow: /

    User-agent: ConveraCrawler
    Disallow: /

    User-agent: MJ12bot
    Disallow: /

    above I have updated just a few days ago with following criteria in mind
    all above have LARGE numbers of months crawl activities in the thousands/m, I have visited the homepage of each of them, and studied the goal and purpose of each, then made my final decision - based on:
    if one pretends to be for a new SE that is already UP - has many thousands of crawls / m and fails to provide even a single decent result for a major keywrod of my site - disallow.
    if a bot is for an obscure "ecomerce" or "society" or the use of the search / crawl results is limited to a restricted NON-public group only or without open public use and that "society" or project does NOT appear on my referrer URL list = disallowed.

    user agent disallowed in my .htaccess are:

    Indy Library
    ibwww-perl/5.79
    WebImages
    Wget
    Offline Navigator
    Xaldon WebSpider

    the first above is used on my site by the ten thousands a month for massive NON-human page-loads by chinese networds across all CN to create fake traffic totaling in excess of hundred thousand pageviews/m on a few selected URLs only.
    most of these CN originating activities come from a LARGE ( hundreds ) nr of IPs and for that reason the majority of such activities is blocked by iptables. to find these IPs, I use to find my .htaccess deny list for those user agents to find the IP ranges, then create new IP rules to ground entire CN-networks.

    above agent list now is in use for about 1 year, every now and then ( may be monthly if I have time ) I remove that deny list and watch if I have even the slightest increae in real-human traffic - NONE so far = hence NO loss to block entire A or B networks from CN.

    others have been used to mass download images - I recently just days ago found entire site-sections MF advertisments ( MFA and others ) made up ONLY with stolen image content from my site, even worst all images hotlinked on copyright infringer's site :) !! hence I strictly disallow any special image-agent OTHER than the major SE.

    wite mirror tools like wget ( I use it myself if I have need to get an open source howto on a subject or so, ) have caused sometimes looping downloads up to ten thousands of occurrences within hours, hence I block them since i offer a complete mirror of my content as download in .zip format.

    I am just these days to make annual inventory on hackers, copyright infringers ( 3 entire sites have been grounded the last few days based on my "inventory" ) and will publish som emore details and copy/paste stuff for htaccess and/or robots.txt in my blog secrets of love / section Internet/SEO.

    how to find out if you are abused by any of above user agents or bots ?
    simply look at your access_log stats in details
    OR
    use zgrep for your logfiles, at least a week or month logfiles - and see how many visits you have and if you need any deny or exclusion rules above.
    example usage of zgrep to search log files would be:

    cd to directory of your log files, then in bash:

    zgrep "Indy Library" access_log-200611*.gz | wc -l
    88358
    88358 = number ( count ) of occurrences from fake traffic originating CN last NOV 2006,
    last months it is
    zgrep "Indy Library" access_log-200710*.gz | wc -l
    12440
    the difference between last Nov and this Oct is clocked by my iptables, this remaining 12440 by my .htaccess deny rules.

    and since this thread also is indirectly about abuse of crawls and bandwidth, etc I am just these days inventuring all my hotlink top list ( world champion is once more myspace.com with ten thousands of hotlinked (often full size wallpapers ) images/month.

    the details and how to find out with exact bash commands will soon be published in my blog for use on other ppls servers/sites.


    RE your POLL question:
    Does your host allow you to upload a robots.txt file?

    it may as well be illegal to DISALLOW use of robots.txt by common law as robots.txt is one globally practiced method to protect site-content and abuse of bandwidth as well as to protect restricted areas from being crawled, like admin areas or cgi-bin, etc
    I have never heard of any host disallowing use of robots.txt OR .htaccess since BOTH are NEEDED for the security, protection and well functioning/operation of an entire site or parts thereof.
     
    hans, Nov 1, 2007 IP
  6. Boardwalk

    Boardwalk Well-Known Member

    Messages:
    1,651
    Likes Received:
    44
    Best Answers:
    0
    Trophy Points:
    140
    #6
    Thanks for that Hans.

    There are some hosts which do not allow robots.txt (some free hosts, I think)...
     
    Boardwalk, Nov 1, 2007 IP
  7. hans

    hans Well-Known Member

    Messages:
    2,923
    Likes Received:
    126
    Best Answers:
    1
    Trophy Points:
    173
    #7
    free hosts ...
    who is nowadays still using free hosts if regular quality hosting is just a tipping compared to adsense revenue potential. ...
    and for all serious webmasters robots.txt is a need to comply with G procedures to have death URLs removed ..
    hence there would be no professional use possible without both robots and htaccess files.
    if a host disallows robots.txt hen may be because they need to crawl entires sites for legitimate reasons of liability protection to avoid site abuse by illegal content - but then again a qualified host can do such self-control much faster directly on the HDD rather than via www.
     
    hans, Nov 1, 2007 IP