Turbo Tax software - Provence Property Guide - Home Loans - Xecuter 3 Mod Chip - Car Insurance

PDA

View Full Version : Robots.txt question


mcsp
Jul 26th 2005, 2:40 pm
Hello, I understand what the robots.txt file is for but I am a bit lost on how to add one and I have a site with this message when I attempt to use a validator.

We're sorry, this robots.txt does NOT validate.
Warnings Detected: 391
Errors Detected: 415

First post here. Thanks for the help

frankm
Jul 26th 2005, 2:43 pm
Hi mcsp -- and welcome to DP!

you create a robots.txt just as every other file (like your index.html), and upload it to the root dir of your website.

you can check it with http://www.yourwebsitename.com/robots.txt

if you do not want to see that error message, just create an empty file (0 bytes) and upload it as robots.txt

mcsp
Jul 26th 2005, 3:08 pm
Hey thanks for the help ! So it is not in the index page html ? It is a .txt file on the root. Much less brain damage then I thought.

Thanks again !

frankm
Jul 26th 2005, 5:15 pm
yeah. it is just that simple :-)

/robots.txt

minstrel
Jul 26th 2005, 7:47 pm
A basic robots.txt file looks like this:

User-agent: *
Disallow:

meaning ALL spiders (*) "disallow" nothing (allow/index everything).

More information

Official robots.txt standards site (http://www.robotstxt.org/wc/norobots.html)

A robots.txt tutorial (http://www.searchengineworld.com/robots/robots_tutorial.htm)

A robots.txt syntax checker (http://www.sxw.org.uk/computing/robots/check.html)

A robots.txt validator (http://www.searchengineworld.com/cgi-bin/robotcheck.cgi)

mcsp
Jul 26th 2005, 8:13 pm
Well I have to say I am impressed with the response. This looks like a great group here. Happy to have stumbled in. Been lurking for a year.

Thanks again

gatordun
Jul 27th 2005, 11:49 am
These are good:

http://www.internet-search-engines-faq.com/bad-robots.shtml

http://www.internet-search-engines-faq.com/robots-txt.shtml

http://tool.motoricerca.info/robots-checker.phtml

http://www.searchengineworld.com/cgi-bin/robotcheck.cgi

gatordun
Jul 28th 2005, 9:37 am
Here is an example that we used before:

# Robots.txt file from http://www.website.com
#
# Bans from text, images and graphics = just add a note
#
User-agent: *
User-agent: alexa.com
User-agent: archive.org
User-agent: ia_archiver
User-agent: Alexibot
User-agent: Aqua_Products
User-agent: BackDoorBot
User-agent: BackDoorBot/1.0
User-agent: Black.Hole
User-agent: BlackWidow
User-agent: BlowFish
User-agent: BlowFish/1.0
User-agent: Bookmark search tool
User-agent: Bot mailto:craftbot@yahoo.com
User-agent: BotALot
User-agent: BotRightHere
User-agent: BuiltBotTough
User-agent: Bullseye
User-agent: Bullseye/1.0
User-agent: BunnySlippers
User-agent: Cegbfeieh
User-agent: CheeseBot
User-agent: CherryPicker
User-agent: CherryPickerElite/1.0
User-agent: CherryPickerSE/1.0
User-agent: ChinaClaw
User-agent: Copernic
User-agent: CopyRightCheck
User-agent: Crescent
User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
User-agent: Custo
User-agent: DISCo
User-agent: DISCo Pump 3.0
User-agent: DISCo Pump 3.2
User-agent: DISCoFinder
User-agent: DittoSpyder
User-agent: Download Demon
User-agent: Download Demon/3.2.0.8
User-agent: Download Demon/3.5.0.11
User-agent: EirGrabber
User-agent: EmailCollector
User-agent: EmailSiphon
User-agent: EmailWolf
User-agent: EroCrawler
User-agent: Express WebPictures
User-agent: Express WebPictures (www.express-soft.com)
User-agent: ExtractorPro
User-agent: EyeNetIE
User-agent: FairAd Client
User-agent: Flaming AttackBot
User-agent: FlashGet
User-agent: FlashGet WebWasher 3.2
User-agent: Foobot
User-agent: FrontPage
User-agent: FrontPage [NC,OR]
User-agent: Gaisbot
User-agent: GetRight
User-agent: GetRight/2.11
User-agent: GetRight/3.1
User-agent: GetRight/3.2
User-agent: GetRight/3.3
User-agent: GetRight/3.3.3
User-agent: GetRight/3.3.4
User-agent: GetRight/4.0.0
User-agent: GetRight/4.1.0
User-agent: GetRight/4.1.1
User-agent: GetRight/4.1.2
User-agent: GetRight/4.2
User-agent: GetRight/4.2b (Portuguxeas)
User-agent: GetRight/4.2c
User-agent: GetRight/4.3
User-agent: GetRight/4.5
User-agent: GetRight/4.5a
User-agent: GetRight/4.5b
User-agent: GetRight/4.5b1
User-agent: GetRight/4.5b2
User-agent: GetRight/4.5b3
User-agent: GetRight/4.5b6
User-agent: GetRight/4.5b7
User-agent: GetRight/4.5c
User-agent: GetRight/4.5d
User-agent: GetRight/4.5e
User-agent: GetRight/5.0beta1
User-agent: GetRight/5.0beta2
User-agent: GetWeb!
User-agent: Go!Zilla
User-agent: Go!Zilla (www.gozilla.com)
User-agent: Go!Zilla 3.3 (www.gozilla.com)
User-agent: Go!Zilla 3.5 (www.gozilla.com)
User-agent: Go-Ahead-Got-It
User-agent: Googlebot
User-agent: Googlebot-Image
User-agent: GrabNet
User-agent: Grafula
User-agent: HMView
User-agent: HTTrack
User-agent: HTTrack 3.0
User-agent: HTTrack [NC,OR]
User-agent: Harvest
User-agent: Harvest/1.5
User-agent: Image Stripper
User-agent: Image Sucker
User-agent: Indy Library
User-agent: Indy Library [NC,OR]
User-agent: InfoNaviRobot
User-agent: InterGET
User-agent: Internet Ninja
User-agent: Internet Ninja 4.0
User-agent: Internet Ninja 5.0
User-agent: Internet Ninja 6.0
User-agent: Iron33/1.0.2
User-agent: JOC Web Spider
User-agent: JennyBot
User-agent: JetCar
User-agent: Kenjin Spider
User-agent: Kenjin.Spider
User-agent: Keyword Density/0.9
User-agent: Keyword.Density
User-agent: LNSpiderguy
User-agent: LeechFTP
User-agent: LexiBot
User-agent: LinkScan/8.1a Unix
User-agent: LinkScan/8.1a.Unix
User-agent: LinkWalker
User-agent: LinkextractorPro
User-agent: MIDown tool
User-agent: MIIxpc
User-agent: MIIxpc/4.2
User-agent: MSIECrawler
User-agent: Mass Downloader
User-agent: Mass Downloader/2.2
User-agent: Mata Hari
User-agent: Mata.Hari
User-agent: Microsoft URL Control
User-agent: Microsoft URL Control - 5.01.4511
User-agent: Microsoft URL Control - 6.00.8169
User-agent: Microsoft.URL
User-agent: Mister PiX
User-agent: Mister PiX version.dll
User-agent: Mister Pix II 2.01
User-agent: Mister Pix II 2.02a
User-agent: Mister.PiX
User-agent: NICErsPRO
User-agent: NPBot
User-agent: NPbot
User-agent: Navroad
User-agent: NearSite
User-agent: Net Vampire
User-agent: Net Vampire/3.0
User-agent: NetAnts
User-agent: NetAnts/1.10
User-agent: NetAnts/1.23
User-agent: NetAnts/1.24
User-agent: NetAnts/1.25
User-agent: NetMechanic
User-agent: NetSpider
User-agent: NetZIP
User-agent: NetZip Downloader 1.0 Win32(Nov 12 1998)
User-agent: NetZip-Downloader/1.0.62 (Win32; Dec 7 1998)
User-agent: NetZippy+(http://www.innerprise.net/usp-spider.asp)
User-agent: Octopus
User-agent: Offline Explorer
User-agent: Offline Explorer/1.2
User-agent: Offline Explorer/1.4
User-agent: Offline Explorer/1.6
User-agent: Offline Explorer/1.7
User-agent: Offline Explorer/1.9
User-agent: Offline Explorer/2.0
User-agent: Offline Explorer/2.1
User-agent: Offline Explorer/2.3
User-agent: Offline Explorer/2.4
User-agent: Offline Explorer/2.5
User-agent: Offline Navigator
User-agent: Offline.Explorer
User-agent: Openbot
User-agent: Openfind
User-agent: Openfind data gatherer
User-agent: Oracle Ultra Search
User-agent: PageGrabber
User-agent: Papa Foto
User-agent: PerMan
User-agent: ProPowerBot/2.14
User-agent: ProWebWalker
User-agent: Python-urllib
User-agent: QueryN Metasearch
User-agent: QueryN.Metasearch
User-agent: RMA
User-agent: Radiation Retriever 1.1
User-agent: ReGet
User-agent: RealDownload
User-agent: RealDownload/4.0.0.40
User-agent: RealDownload/4.0.0.41
User-agent: RealDownload/4.0.0.42
User-agent: RepoMonkey
User-agent: RepoMonkey Bait & Tackle/v1.01
User-agent: SiteSnagger
User-agent: SlySearch
User-agent: SmartDownload
User-agent: SmartDownload/1.2.76 (Win32; Apr 1 1999)
User-agent: SmartDownload/1.2.77 (Win32; Aug 17 1999)
User-agent: SmartDownload/1.2.77 (Win32; Feb 1 2000)
User-agent: SmartDownload/1.2.77 (Win32; Jun 19 2001)
User-agent: SpankBot
User-agent: Sqworm/2.9.85-BETA (beta_release; 20011115-775; i686-pc-linux
User-agent: SuperBot
User-agent: SuperBot/3.0 (Win32)
User-agent: SuperBot/3.1 (Win32)
User-agent: SuperHTTP
User-agent: SuperHTTP/1.0
User-agent: Surfbot
User-agent: Szukacz/1.4
User-agent: Teleport
User-agent: Teleport Pro
User-agent: Teleport Pro/1.29
User-agent: Teleport Pro/1.29.1590
User-agent: Teleport Pro/1.29.1634
User-agent: Teleport Pro/1.29.1718
User-agent: Teleport Pro/1.29.1820
User-agent: Teleport Pro/1.29.1847
User-agent: TeleportPro
User-agent: Telesoft
User-agent: The Intraformant
User-agent: The.Intraformant
User-agent: TheNomad
User-agent: TightTwatBot
User-agent: Titan
User-agent: True_Robot
User-agent: True_Robot/1.0
User-agent: TurnitinBot
User-agent: TurnitinBot/1.5
User-agent: URL Control
User-agent: URL_Spider_Pro
User-agent: URLy Warning
User-agent: URLy.Warning
User-agent: VCI
User-agent: VCI WebViewer VCI WebViewer Win32
User-agent: VoidEYE
User-agent: WWW-Collector-E
User-agent: WWWOFFLE
User-agent: Web Image Collector
User-agent: Web Sucker
User-agent: Web.Image.Collector
User-agent: WebAuto
User-agent: WebAuto/3.40 (Win98; I)
User-agent: WebBandit
User-agent: WebBandit/3.50
User-agent: WebCapture 2.0
User-agent: WebCopier
User-agent: WebCopier v.2.2
User-agent: WebCopier v2.5
User-agent: WebCopier v2.6
User-agent: WebCopier v2.7a
User-agent: WebCopier v2.8
User-agent: WebCopier v3.0
User-agent: WebCopier v3.0.1
User-agent: WebCopier v3.2
User-agent: WebCopier v3.2a
User-agent: WebEMailExtrac.*
User-agent: WebEnhancer
User-agent: WebFetch
User-agent: WebGo IS
User-agent: WebLeacher
User-agent: WebReaper
User-agent: WebReaper [info@webreaper.net]
User-agent: WebReaper [webreaper@otway.com]
User-agent: WebReaper v9.1 - www.otway.com/webreaper
User-agent: WebReaper v9.7 - www.webreaper.net
User-agent: WebReaper v9.8 - www.webreaper.net
User-agent: WebReaper vWebReaper v7.3 - www,otway.com/webreaper
User-agent: WebSauger
User-agent: WebSauger 1.20b
User-agent: WebSauger 1.20j
User-agent: WebSauger 1.20k
User-agent: WebStripper
User-agent: WebStripper/2.03
User-agent: WebStripper/2.10
User-agent: WebStripper/2.12
User-agent: WebStripper/2.13
User-agent: WebStripper/2.15
User-agent: WebStripper/2.16
User-agent: WebStripper/2.19
User-agent: WebWhacker
User-agent: WebZIP
User-agent: WebZIP/2.75 (http://www.spidersoft.com)
User-agent: WebZIP/3.65 (http://www.spidersoft.com)
User-agent: WebZIP/3.80 (http://www.spidersoft.com)
User-agent: WebZIP/4.0 (http://www.spidersoft.com)
User-agent: WebZIP/4.1 (http://www.spidersoft.com)
User-agent: WebZIP/4.21
User-agent: WebZIP/4.21 (http://www.spidersoft.com)
User-agent: WebZIP/5.0
User-agent: WebZIP/5.0 (http://www.spidersoft.com)
User-agent: WebZIP/5.0 PR1 (http://www.spidersoft.com)
User-agent: WebZip
User-agent: WebZip/4.0
User-agent: WebmasterWorldForumBot
User-agent: Website Quester
User-agent: Website Quester - www.asona.org
User-agent: Website Quester - www.esalesbiz.com/extra/
User-agent: Website eXtractor
User-agent: Website eXtractor (http://www.asona.org)
User-agent: Website.Quester
User-agent: Webster Pro
User-agent: Webster.Pro
User-agent: Wget
User-agent: Wget/1.5.2
User-agent: Wget/1.5.3
User-agent: Wget/1.6
User-agent: Wget/1.7
User-agent: Wget/1.8
User-agent: Wget/1.8.1
User-agent: Wget/1.8.1+cvs
User-agent: Wget/1.8.2
User-agent: Wget/1.9-beta
User-agent: Widow
User-agent: Xaldon WebSpider
User-agent: Xaldon WebSpider 2.5.b3
User-agent: Xenu's
User-agent: Xenu's Link Sleuth 1.1c

gatordun
Jul 28th 2005, 9:37 am
here is the rest of it:

User-agent: Zeus
User-agent: Zeus 11389 Webster Pro V2.9 Win32
User-agent: Zeus 11652 Webster Pro V2.9 Win32
User-agent: Zeus 18018 Webster Pro V2.9 Win32
User-agent: Zeus 26378 Webster Pro V2.9 Win32
User-agent: Zeus 30747 Webster Pro V2.9 Win32
User-agent: Zeus 32297 Webster Pro V2.9 Win32
User-agent: Zeus 39206 Webster Pro V2.9 Win32
User-agent: Zeus 41641 Webster Pro V2.9 Win32
User-agent: Zeus 44238 Webster Pro V2.9 Win32
User-agent: Zeus 51070 Webster Pro V2.9 Win32
User-agent: Zeus 51674 Webster Pro V2.9 Win32
User-agent: Zeus 51837 Webster Pro V2.9 Win32
User-agent: Zeus 63567 Webster Pro V2.9 Win32
User-agent: Zeus 6694 Webster Pro V2.9 Win32
User-agent: Zeus 71129 Webster Pro V2.9 Win32
User-agent: Zeus 82016 Webster Pro V2.9 Win32
User-agent: Zeus 82900 Webster Pro V2.9 Win32
User-agent: Zeus 84842 Webster Pro V2.9 Win32
User-agent: Zeus 90872 Webster Pro V2.9 Win32
User-agent: Zeus 94934 Webster Pro V2.9 Win32
User-agent: Zeus 95245 Webster Pro V2.9 Win32
User-agent: Zeus 95351 Webster Pro V2.9 Win32
User-agent: Zeus 97371 Webster Pro V2.9 Win32
User-agent: Zeus Link Scout
User-agent: asterias
User-agent: b2w/0.1
User-agent: cosmos
User-agent: eCatch
User-agent: eCatch/3.0
User-agent: hloader
User-agent: httplib
User-agent: humanlinks
User-agent: larbin
User-agent: larbin (samualt9@bigfoot.com)
User-agent: larbin samualt9@bigfoot.com
User-agent: larbin_2.6.2 (kabura@sushi.com)
User-agent: larbin_2.6.2 (larbin2.6.2@unspecified.mail)
User-agent: larbin_2.6.2 (listonATccDOTgatechDOTedu)
User-agent: larbin_2.6.2 (vitalbox1@hotmail.com)
User-agent: larbin_2.6.2 kabura@sushi.com
User-agent: larbin_2.6.2 larbin2.6.2@unspecified.mail
User-agent: larbin_2.6.2 larbin@correa.org
User-agent: larbin_2.6.2 listonATccDOTgatechDOTedu
User-agent: larbin_2.6.2 vitalbox1@hotmail.com
User-agent: libWeb/clsHTTP
User-agent: lwp-trivial
User-agent: lwp-trivial/1.34
User-agent: moget
User-agent: moget/2.1
User-agent: pavuk
User-agent: pcBrowser
User-agent: psbot
User-agent: searchpreview
User-agent: spanner
User-agent: suzuran
User-agent: tAkeOut
User-agent: toCrawl/UrlDispatcher
User-agent: turingos
User-agent: webfetch/2.1.0
User-agent: wget
Disallow: /
Disallow: /.gif$
Disallow: /.jpg$
Disallow: /.jpeg$
Disallow: /.png$
Disallow: /addanyfileoranydirectoryhere

minstrel
Jul 28th 2005, 9:48 am
Geeze, gatordun... that's WAY overkill, IMO.

I've seen such lists before. Prior to uploading that robots.txt file, how many of those had actually visited your site?

In particular, all thosee Getright and Filezilla references are to download accelerators -- why are you so worried about them?Mu advice is to keep your robots.txt file as simple as possible. If you do find a rogue bot eating bandwidth, ban it. But you just don't need these huge robots.txt files, IMO.

Crazy_Rob
Jul 28th 2005, 9:50 am
Especially since most of those bots don't obey robots.txt anyway.

gatordun
Jul 28th 2005, 11:30 am
It's a friends list.
He guards against everything and needs too.
That is why we are looking into excluding them in .htaccess.
Look under apache htaccess area.
The one thing is htaccess usually has to be disabled to load frontpage webs.
Still looking for a tweak for that.

gatordun
Jul 28th 2005, 11:35 am
Here is a list here for the htaccess file that we are working on.
http://forums.digitalpoint.com/showthread.php?t=22487

Up2U - YourWorld
Jul 28th 2005, 5:12 pm
OMFG!
How many robots have you submitted to.

Personally I would put a disallow to any non-page directory to all bots, but let google adsense go anywhere it likes - more content; more ads!

minstrel
Jul 28th 2005, 6:36 pm
The one thing is htaccess usually has to be disabled to load frontpage webs.
No, that's incorrect.

When the FPSE are installed, FP installs it's own htaccess file (and hides it). Use a 3rd party FTP program to unhide the file on your server if necessary and copy it back to your hard drive.

Then, edit it with notepad and -- THIS IS IMPORTANT -- add any additional htaccess lines you wish TO THE BOTTOM OF THE ORIGINAL htaccess file.

Then upload the appended/amended file back to your server.

gatordun
Jul 29th 2005, 8:56 am
Mistrel we use FLASHFTP for the htaccess file.
Right now we edit it live.
But we have a error, when we try to publish to the site, we are working on that, so we disable the full htaccess file and put up the default htaccess file to publish or tweak the site, then put back the full htacess file after we are done with editing or publishing.
It's a temp solution.
So it's better to leave images and the site locked down, until we figure out where the htaccess problem is.

minstrel
Jul 29th 2005, 9:13 am
Re-read my post.

Either you overwrote the original FP htaccess file, or you've messed up the htaccess file in some other way.

gatordun
Jul 29th 2005, 11:07 am
We know and we are still looking.
But it works for now and bans people / countires / bandwidth thieves and indexing images and using our images on other sites.
That is the main directive for now!
Everything always needs a tweak.