Robots.txt question

Discussion in 'robots.txt' started by mcsp, Jul 26, 2005.

  1. #1
    Hello, I understand what the robots.txt file is for but I am a bit lost on how to add one and I have a site with this message when I attempt to use a validator.

    We're sorry, this robots.txt does NOT validate.
    Warnings Detected: 391
    Errors Detected: 415


    First post here. Thanks for the help
     
    mcsp, Jul 26, 2005 IP
  2. frankm

    frankm Active Member

    Messages:
    915
    Likes Received:
    63
    Best Answers:
    0
    Trophy Points:
    83
    #2
    Hi mcsp -- and welcome to DP!

    you create a robots.txt just as every other file (like your index.html), and upload it to the root dir of your website.

    you can check it with http://www.yourwebsitename.com/robots.txt

    if you do not want to see that error message, just create an empty file (0 bytes) and upload it as robots.txt
     
    frankm, Jul 26, 2005 IP
  3. mcsp

    mcsp Peon

    Messages:
    56
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Hey thanks for the help ! So it is not in the index page html ? It is a .txt file on the root. Much less brain damage then I thought.

    Thanks again !
     
    mcsp, Jul 26, 2005 IP
  4. frankm

    frankm Active Member

    Messages:
    915
    Likes Received:
    63
    Best Answers:
    0
    Trophy Points:
    83
    #4
    yeah. it is just that simple :)

    /robots.txt
     
    frankm, Jul 26, 2005 IP
  5. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #5
    minstrel, Jul 26, 2005 IP
  6. mcsp

    mcsp Peon

    Messages:
    56
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Well I have to say I am impressed with the response. This looks like a great group here. Happy to have stumbled in. Been lurking for a year.

    Thanks again
     
    mcsp, Jul 26, 2005 IP
  7. gatordun

    gatordun Guest

    Messages:
    114
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
  8. gatordun

    gatordun Guest

    Messages:
    114
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Here is an example that we used before:

    # Robots.txt file from http://www.website.com
    #
    # Bans from text, images and graphics = just add a note
    #
    User-agent: *
    User-agent: alexa.com
    User-agent: archive.org
    User-agent: ia_archiver
    User-agent: Alexibot
    User-agent: Aqua_Products
    User-agent: BackDoorBot
    User-agent: BackDoorBot/1.0
    User-agent: Black.Hole
    User-agent: BlackWidow
    User-agent: BlowFish
    User-agent: BlowFish/1.0
    User-agent: Bookmark search tool
    User-agent: Bot mailto:craftbot@yahoo.com
    User-agent: BotALot
    User-agent: BotRightHere
    User-agent: BuiltBotTough
    User-agent: Bullseye
    User-agent: Bullseye/1.0
    User-agent: BunnySlippers
    User-agent: Cegbfeieh
    User-agent: CheeseBot
    User-agent: CherryPicker
    User-agent: CherryPickerElite/1.0
    User-agent: CherryPickerSE/1.0
    User-agent: ChinaClaw
    User-agent: Copernic
    User-agent: CopyRightCheck
    User-agent: Crescent
    User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
    User-agent: Custo
    User-agent: DISCo
    User-agent: DISCo Pump 3.0
    User-agent: DISCo Pump 3.2
    User-agent: DISCoFinder
    User-agent: DittoSpyder
    User-agent: Download Demon
    User-agent: Download Demon/3.2.0.8
    User-agent: Download Demon/3.5.0.11
    User-agent: EirGrabber
    User-agent: EmailCollector
    User-agent: EmailSiphon
    User-agent: EmailWolf
    User-agent: EroCrawler
    User-agent: Express WebPictures
    User-agent: Express WebPictures (www.express-soft.com)
    User-agent: ExtractorPro
    User-agent: EyeNetIE
    User-agent: FairAd Client
    User-agent: Flaming AttackBot
    User-agent: FlashGet
    User-agent: FlashGet WebWasher 3.2
    User-agent: Foobot
    User-agent: FrontPage
    User-agent: FrontPage [NC,OR]
    User-agent: Gaisbot
    User-agent: GetRight
    User-agent: GetRight/2.11
    User-agent: GetRight/3.1
    User-agent: GetRight/3.2
    User-agent: GetRight/3.3
    User-agent: GetRight/3.3.3
    User-agent: GetRight/3.3.4
    User-agent: GetRight/4.0.0
    User-agent: GetRight/4.1.0
    User-agent: GetRight/4.1.1
    User-agent: GetRight/4.1.2
    User-agent: GetRight/4.2
    User-agent: GetRight/4.2b (Portuguxeas)
    User-agent: GetRight/4.2c
    User-agent: GetRight/4.3
    User-agent: GetRight/4.5
    User-agent: GetRight/4.5a
    User-agent: GetRight/4.5b
    User-agent: GetRight/4.5b1
    User-agent: GetRight/4.5b2
    User-agent: GetRight/4.5b3
    User-agent: GetRight/4.5b6
    User-agent: GetRight/4.5b7
    User-agent: GetRight/4.5c
    User-agent: GetRight/4.5d
    User-agent: GetRight/4.5e
    User-agent: GetRight/5.0beta1
    User-agent: GetRight/5.0beta2
    User-agent: GetWeb!
    User-agent: Go!Zilla
    User-agent: Go!Zilla (www.gozilla.com)
    User-agent: Go!Zilla 3.3 (www.gozilla.com)
    User-agent: Go!Zilla 3.5 (www.gozilla.com)
    User-agent: Go-Ahead-Got-It
    User-agent: Googlebot
    User-agent: Googlebot-Image
    User-agent: GrabNet
    User-agent: Grafula
    User-agent: HMView
    User-agent: HTTrack
    User-agent: HTTrack 3.0
    User-agent: HTTrack [NC,OR]
    User-agent: Harvest
    User-agent: Harvest/1.5
    User-agent: Image Stripper
    User-agent: Image Sucker
    User-agent: Indy Library
    User-agent: Indy Library [NC,OR]
    User-agent: InfoNaviRobot
    User-agent: InterGET
    User-agent: Internet Ninja
    User-agent: Internet Ninja 4.0
    User-agent: Internet Ninja 5.0
    User-agent: Internet Ninja 6.0
    User-agent: Iron33/1.0.2
    User-agent: JOC Web Spider
    User-agent: JennyBot
    User-agent: JetCar
    User-agent: Kenjin Spider
    User-agent: Kenjin.Spider
    User-agent: Keyword Density/0.9
    User-agent: Keyword.Density
    User-agent: LNSpiderguy
    User-agent: LeechFTP
    User-agent: LexiBot
    User-agent: LinkScan/8.1a Unix
    User-agent: LinkScan/8.1a.Unix
    User-agent: LinkWalker
    User-agent: LinkextractorPro
    User-agent: MIDown tool
    User-agent: MIIxpc
    User-agent: MIIxpc/4.2
    User-agent: MSIECrawler
    User-agent: Mass Downloader
    User-agent: Mass Downloader/2.2
    User-agent: Mata Hari
    User-agent: Mata.Hari
    User-agent: Microsoft URL Control
    User-agent: Microsoft URL Control - 5.01.4511
    User-agent: Microsoft URL Control - 6.00.8169
    User-agent: Microsoft.URL
    User-agent: Mister PiX
    User-agent: Mister PiX version.dll
    User-agent: Mister Pix II 2.01
    User-agent: Mister Pix II 2.02a
    User-agent: Mister.PiX
    User-agent: NICErsPRO
    User-agent: NPBot
    User-agent: NPbot
    User-agent: Navroad
    User-agent: NearSite
    User-agent: Net Vampire
    User-agent: Net Vampire/3.0
    User-agent: NetAnts
    User-agent: NetAnts/1.10
    User-agent: NetAnts/1.23
    User-agent: NetAnts/1.24
    User-agent: NetAnts/1.25
    User-agent: NetMechanic
    User-agent: NetSpider
    User-agent: NetZIP
    User-agent: NetZip Downloader 1.0 Win32(Nov 12 1998)
    User-agent: NetZip-Downloader/1.0.62 (Win32; Dec 7 1998)
    User-agent: NetZippy+(http://www.innerprise.net/usp-spider.asp)
    User-agent: Octopus
    User-agent: Offline Explorer
    User-agent: Offline Explorer/1.2
    User-agent: Offline Explorer/1.4
    User-agent: Offline Explorer/1.6
    User-agent: Offline Explorer/1.7
    User-agent: Offline Explorer/1.9
    User-agent: Offline Explorer/2.0
    User-agent: Offline Explorer/2.1
    User-agent: Offline Explorer/2.3
    User-agent: Offline Explorer/2.4
    User-agent: Offline Explorer/2.5
    User-agent: Offline Navigator
    User-agent: Offline.Explorer
    User-agent: Openbot
    User-agent: Openfind
    User-agent: Openfind data gatherer
    User-agent: Oracle Ultra Search
    User-agent: PageGrabber
    User-agent: Papa Foto
    User-agent: PerMan
    User-agent: ProPowerBot/2.14
    User-agent: ProWebWalker
    User-agent: Python-urllib
    User-agent: QueryN Metasearch
    User-agent: QueryN.Metasearch
    User-agent: RMA
    User-agent: Radiation Retriever 1.1
    User-agent: ReGet
    User-agent: RealDownload
    User-agent: RealDownload/4.0.0.40
    User-agent: RealDownload/4.0.0.41
    User-agent: RealDownload/4.0.0.42
    User-agent: RepoMonkey
    User-agent: RepoMonkey Bait & Tackle/v1.01
    User-agent: SiteSnagger
    User-agent: SlySearch
    User-agent: SmartDownload
    User-agent: SmartDownload/1.2.76 (Win32; Apr 1 1999)
    User-agent: SmartDownload/1.2.77 (Win32; Aug 17 1999)
    User-agent: SmartDownload/1.2.77 (Win32; Feb 1 2000)
    User-agent: SmartDownload/1.2.77 (Win32; Jun 19 2001)
    User-agent: SpankBot
    User-agent: Sqworm/2.9.85-BETA (beta_release; 20011115-775; i686-pc-linux
    User-agent: SuperBot
    User-agent: SuperBot/3.0 (Win32)
    User-agent: SuperBot/3.1 (Win32)
    User-agent: SuperHTTP
    User-agent: SuperHTTP/1.0
    User-agent: Surfbot
    User-agent: Szukacz/1.4
    User-agent: Teleport
    User-agent: Teleport Pro
    User-agent: Teleport Pro/1.29
    User-agent: Teleport Pro/1.29.1590
    User-agent: Teleport Pro/1.29.1634
    User-agent: Teleport Pro/1.29.1718
    User-agent: Teleport Pro/1.29.1820
    User-agent: Teleport Pro/1.29.1847
    User-agent: TeleportPro
    User-agent: Telesoft
    User-agent: The Intraformant
    User-agent: The.Intraformant
    User-agent: TheNomad
    User-agent: TightTwatBot
    User-agent: Titan
    User-agent: True_Robot
    User-agent: True_Robot/1.0
    User-agent: TurnitinBot
    User-agent: TurnitinBot/1.5
    User-agent: URL Control
    User-agent: URL_Spider_Pro
    User-agent: URLy Warning
    User-agent: URLy.Warning
    User-agent: VCI
    User-agent: VCI WebViewer VCI WebViewer Win32
    User-agent: VoidEYE
    User-agent: WWW-Collector-E
    User-agent: WWWOFFLE
    User-agent: Web Image Collector
    User-agent: Web Sucker
    User-agent: Web.Image.Collector
    User-agent: WebAuto
    User-agent: WebAuto/3.40 (Win98; I)
    User-agent: WebBandit
    User-agent: WebBandit/3.50
    User-agent: WebCapture 2.0
    User-agent: WebCopier
    User-agent: WebCopier v.2.2
    User-agent: WebCopier v2.5
    User-agent: WebCopier v2.6
    User-agent: WebCopier v2.7a
    User-agent: WebCopier v2.8
    User-agent: WebCopier v3.0
    User-agent: WebCopier v3.0.1
    User-agent: WebCopier v3.2
    User-agent: WebCopier v3.2a
    User-agent: WebEMailExtrac.*
    User-agent: WebEnhancer
    User-agent: WebFetch
    User-agent: WebGo IS
    User-agent: WebLeacher
    User-agent: WebReaper
    User-agent: WebReaper [info@webreaper.net]
    User-agent: WebReaper [webreaper@otway.com]
    User-agent: WebReaper v9.1 - www.otway.com/webreaper
    User-agent: WebReaper v9.7 - www.webreaper.net
    User-agent: WebReaper v9.8 - www.webreaper.net
    User-agent: WebReaper vWebReaper v7.3 - www,otway.com/webreaper
    User-agent: WebSauger
    User-agent: WebSauger 1.20b
    User-agent: WebSauger 1.20j
    User-agent: WebSauger 1.20k
    User-agent: WebStripper
    User-agent: WebStripper/2.03
    User-agent: WebStripper/2.10
    User-agent: WebStripper/2.12
    User-agent: WebStripper/2.13
    User-agent: WebStripper/2.15
    User-agent: WebStripper/2.16
    User-agent: WebStripper/2.19
    User-agent: WebWhacker
    User-agent: WebZIP
    User-agent: WebZIP/2.75 (http://www.spidersoft.com)
    User-agent: WebZIP/3.65 (http://www.spidersoft.com)
    User-agent: WebZIP/3.80 (http://www.spidersoft.com)
    User-agent: WebZIP/4.0 (http://www.spidersoft.com)
    User-agent: WebZIP/4.1 (http://www.spidersoft.com)
    User-agent: WebZIP/4.21
    User-agent: WebZIP/4.21 (http://www.spidersoft.com)
    User-agent: WebZIP/5.0
    User-agent: WebZIP/5.0 (http://www.spidersoft.com)
    User-agent: WebZIP/5.0 PR1 (http://www.spidersoft.com)
    User-agent: WebZip
    User-agent: WebZip/4.0
    User-agent: WebmasterWorldForumBot
    User-agent: Website Quester
    User-agent: Website Quester - www.asona.org
    User-agent: Website Quester - www.esalesbiz.com/extra/
    User-agent: Website eXtractor
    User-agent: Website eXtractor (http://www.asona.org)
    User-agent: Website.Quester
    User-agent: Webster Pro
    User-agent: Webster.Pro
    User-agent: Wget
    User-agent: Wget/1.5.2
    User-agent: Wget/1.5.3
    User-agent: Wget/1.6
    User-agent: Wget/1.7
    User-agent: Wget/1.8
    User-agent: Wget/1.8.1
    User-agent: Wget/1.8.1+cvs
    User-agent: Wget/1.8.2
    User-agent: Wget/1.9-beta
    User-agent: Widow
    User-agent: Xaldon WebSpider
    User-agent: Xaldon WebSpider 2.5.b3
    User-agent: Xenu's
    User-agent: Xenu's Link Sleuth 1.1c
     
    gatordun, Jul 28, 2005 IP
  9. gatordun

    gatordun Guest

    Messages:
    114
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #9
    here is the rest of it:

    User-agent: Zeus
    User-agent: Zeus 11389 Webster Pro V2.9 Win32
    User-agent: Zeus 11652 Webster Pro V2.9 Win32
    User-agent: Zeus 18018 Webster Pro V2.9 Win32
    User-agent: Zeus 26378 Webster Pro V2.9 Win32
    User-agent: Zeus 30747 Webster Pro V2.9 Win32
    User-agent: Zeus 32297 Webster Pro V2.9 Win32
    User-agent: Zeus 39206 Webster Pro V2.9 Win32
    User-agent: Zeus 41641 Webster Pro V2.9 Win32
    User-agent: Zeus 44238 Webster Pro V2.9 Win32
    User-agent: Zeus 51070 Webster Pro V2.9 Win32
    User-agent: Zeus 51674 Webster Pro V2.9 Win32
    User-agent: Zeus 51837 Webster Pro V2.9 Win32
    User-agent: Zeus 63567 Webster Pro V2.9 Win32
    User-agent: Zeus 6694 Webster Pro V2.9 Win32
    User-agent: Zeus 71129 Webster Pro V2.9 Win32
    User-agent: Zeus 82016 Webster Pro V2.9 Win32
    User-agent: Zeus 82900 Webster Pro V2.9 Win32
    User-agent: Zeus 84842 Webster Pro V2.9 Win32
    User-agent: Zeus 90872 Webster Pro V2.9 Win32
    User-agent: Zeus 94934 Webster Pro V2.9 Win32
    User-agent: Zeus 95245 Webster Pro V2.9 Win32
    User-agent: Zeus 95351 Webster Pro V2.9 Win32
    User-agent: Zeus 97371 Webster Pro V2.9 Win32
    User-agent: Zeus Link Scout
    User-agent: asterias
    User-agent: b2w/0.1
    User-agent: cosmos
    User-agent: eCatch
    User-agent: eCatch/3.0
    User-agent: hloader
    User-agent: httplib
    User-agent: humanlinks
    User-agent: larbin
    User-agent: larbin (samualt9@bigfoot.com)
    User-agent: larbin
    User-agent: larbin_2.6.2 (kabura@sushi.com)
    User-agent: larbin_2.6.2 (larbin2.6.2@unspecified.mail)
    User-agent: larbin_2.6.2 (listonATccDOTgatechDOTedu)
    User-agent: larbin_2.6.2 (vitalbox1@hotmail.com)
    User-agent: larbin_2.6.2
    User-agent: larbin_2.6.2
    User-agent: larbin_2.6.2
    User-agent: larbin_2.6.2 listonATccDOTgatechDOTedu
    User-agent: larbin_2.6.2
    User-agent: libWeb/clsHTTP
    User-agent: lwp-trivial
    User-agent: lwp-trivial/1.34
    User-agent: moget
    User-agent: moget/2.1
    User-agent: pavuk
    User-agent: pcBrowser
    User-agent: psbot
    User-agent: searchpreview
    User-agent: spanner
    User-agent: suzuran
    User-agent: tAkeOut
    User-agent: toCrawl/UrlDispatcher
    User-agent: turingos
    User-agent: webfetch/2.1.0
    User-agent: wget
    Disallow: /
    Disallow: /.gif$
    Disallow: /.jpg$
    Disallow: /.jpeg$
    Disallow: /.png$
    Disallow: /addanyfileoranydirectoryhere
     
    gatordun, Jul 28, 2005 IP
  10. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #10
    Geeze, gatordun... that's WAY overkill, IMO.

    I've seen such lists before. Prior to uploading that robots.txt file, how many of those had actually visited your site?

    In particular, all thosee Getright and Filezilla references are to download accelerators -- why are you so worried about them?Mu advice is to keep your robots.txt file as simple as possible. If you do find a rogue bot eating bandwidth, ban it. But you just don't need these huge robots.txt files, IMO.
     
    minstrel, Jul 28, 2005 IP
  11. Crazy_Rob

    Crazy_Rob I seen't it!

    Messages:
    13,157
    Likes Received:
    1,366
    Best Answers:
    0
    Trophy Points:
    360
    #11
    Especially since most of those bots don't obey robots.txt anyway.
     
    Crazy_Rob, Jul 28, 2005 IP
  12. gatordun

    gatordun Guest

    Messages:
    114
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #12
    It's a friends list.
    He guards against everything and needs too.
    That is why we are looking into excluding them in .htaccess.
    Look under apache htaccess area.
    The one thing is htaccess usually has to be disabled to load frontpage webs.
    Still looking for a tweak for that.
     
    gatordun, Jul 28, 2005 IP
  13. gatordun

    gatordun Guest

    Messages:
    114
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #13
    gatordun, Jul 28, 2005 IP
  14. Up2U - YourWorld

    Up2U - YourWorld Peon

    Messages:
    47
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #14
    OMFG!
    How many robots have you submitted to.

    Personally I would put a disallow to any non-page directory to all bots, but let google adsense go anywhere it likes - more content; more ads!
     
    Up2U - YourWorld, Jul 28, 2005 IP
  15. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #15
    No, that's incorrect.

    When the FPSE are installed, FP installs it's own htaccess file (and hides it). Use a 3rd party FTP program to unhide the file on your server if necessary and copy it back to your hard drive.

    Then, edit it with notepad and -- THIS IS IMPORTANT -- add any additional htaccess lines you wish TO THE BOTTOM OF THE ORIGINAL htaccess file.

    Then upload the appended/amended file back to your server.
     
    minstrel, Jul 28, 2005 IP
  16. gatordun

    gatordun Guest

    Messages:
    114
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #16
    Mistrel we use FLASHFTP for the htaccess file.
    Right now we edit it live.
    But we have a error, when we try to publish to the site, we are working on that, so we disable the full htaccess file and put up the default htaccess file to publish or tweak the site, then put back the full htacess file after we are done with editing or publishing.
    It's a temp solution.
    So it's better to leave images and the site locked down, until we figure out where the htaccess problem is.
     
    gatordun, Jul 29, 2005 IP
  17. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #17
    Re-read my post.

    Either you overwrote the original FP htaccess file, or you've messed up the htaccess file in some other way.
     
    minstrel, Jul 29, 2005 IP
  18. gatordun

    gatordun Guest

    Messages:
    114
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #18
    We know and we are still looking.
    But it works for now and bans people / countires / bandwidth thieves and indexing images and using our images on other sites.
    That is the main directive for now!
    Everything always needs a tweak.
     
    gatordun, Jul 29, 2005 IP