1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Infant-level robots.txt question.

Discussion in 'robots.txt' started by Owlcroft, Dec 26, 2004.

  1. #1
    I have long assumed that the syntax of a robots.txt file was pretty straightforward, but I now wonder if I'm just stupid, or what.

    The general form of an exclusion is, as I understand it, this--

    User-agent: agentname
    Disallow: filespec
    --where one can use an asterisk * as a wildcard for "all robots". And to disallow multiple specs for the same robot(s), one just tacks on more Disallow: lines (with no skipped lines).

    My question or problem is with the filespec part. I figured, naively, that if I did a little search, I would find the rules laid out, neat, clean, fair, and square. So I did--except that several such layings-out differed on some important details. And the big problem is that this is not like debugging something where you can readily control the environment and see the results, as one can with anything from a script to an .htaccess file: here, there is no simple or immediate way to know the effect of what one does or changes.

    Examples: one wants to exclude everything in the directory--

    OK, easy enough:

    Disallow: /hotstuff/
    Now, one wants to exclude the root-stored file wowie.html; simple enough:

    Disallow: /wowie.html
    But suppose we have this structure:

    /project3/help/forjerks.html
    Will this--

    Disallow: /help/
    --block it? Or must we have the full--

    Disallow: /project3/help/
    --?

    Suppose we had this:

    Disallow: forjerks.html
    Would that exclude that file? Would it exclude all files so named, no matter where stored? Or would we need this--

    Disallow: /project1/help/forjerks.html
    Disallow: /project2/help/forjerks.html
    Disallow: /project3/help/forjerks.html
    --to catch all three instances?

    You can, believe me, get different answers, to the extent that such issues are addressed at all, which is rarely, and usually by implication only.

    For now, having had some problems, I am taking the view that you can't go wrong being fully explicit, and so am excluding particular files by full pathname/filename, even if I need to do that more than once for a site.

    The spec itself--which is sorely due for some updating--would seem to be saying that a full path from the root is needed, with whatever is left off treated as being wildcarded, so that--

    Disallow: /project
    --would, as I understand it (which may be wrong), exclude /project1/ and all contents and subdirectories, as well as /projected_budget/ and all contents and subdirectories, but, I now think, would not exclude /work/project/ or any contents thereof.

    But does anyone have definite knowledge to share?
     
    Owlcroft, Dec 26, 2004 IP
    vagrant likes this.
  2. Refrozen

    Refrozen Peon

    Messages:
    318
    Likes Received:
    9
    Best Answers:
    0
    Trophy Points:
    0
    #2
    "But does anyone have definite knowledge to share?"

    Well, that doesn't really matter, it more depends what the BOT does and how it is coded. I'd say doing a full path would be the only definate way, then ban any bots that go somewhere they are banned
     
    Refrozen, Dec 26, 2004 IP
  3. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
  4. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #4
    I always prefer the original. Here's the official (incomplete) Internet draft:

    http://www.robotstxt.org/wc/norobots-rfc.html

    In brief, the path following Allow: or Disallow: must match exactly, including the trailing slash, the URL in question. For example:

    Disallow: /my-path

    will match /my-path* and /my-path/*. However,

    Disallow: /my-path/

    will only match /my-path/* (asterisk stands for any-other-character).

    Keep in mind - unlike domain names, URLs are case-sensitive, regardless of your OS.

    J.D.
     
    J.D., Dec 27, 2004 IP
  5. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #5
    If one goes, as I did in looking about, to http://www.robotstxt.org/wc/exclusion.html, one is there referred, for the nuts and bolts, to http://www.robotstxt.org/wc/exclusion-admin.html, which has several example lines that--in my opinion--are not very clearly explained as to their precise effect. It also contains the statement that [T]here is no "Allow" field ; that is at odds with the (apparently extant) RFC, at http://www.robotstxt.org/wc/norobots-rfc.html, which explicitly refers to such lines (3.2.2 The Allow and Disallow lines).

    The first-cited page above also refers one to The original 1994 protocol description, as currently deployed, at http://www.robotstxt.org/wc/norobots.html, which is rather different from the 1996 RFC cited above: for one major thing, it does not recognize the "Allow" statement. And in any event, though I now see that it seems to be correct, it is far from lucid.

    The RFC appears to me the clearest. It includes the sufficiently clear statement that The match evaluates positively if and only if the end of the path from therecord is reached before a difference in octets is encountered.

    This isn't an academic paper, so I won't cite the numerous "informational" pages to be found that get these matters partly or severely wrong, but believe me, they're out there.

    For anyone wanting a simple summary, here is what I now think is correct:

    1. Directives come in blocks, separated by blank lines; do not put a blank line within a block.

    2. Blocks comprise one or more "User-agent" lines, that identify the particular robots to whom the block applies, followed by one or more "Disallow" lines that specify what the designated robot or robots are to avoid. It appears that there is no general robotic implementation of an "Allow" line, but certain robots--notably Google's--are said to understand such lines. Caveat emptor.

    2. a. Robot identification is based on supposedly widely know identifiers for the various robots (there are several tables of such identifiers available on the web.)

    2. b. Use one line for each robot: do not conglomerate multiple robots on one line. The form is exact:

    User-agent: identifier
    2. c. An asterisk may be used to signify "all robots"; that is the only sort of "wildcard" allowed on User-agent lines.

    2. d. But letter case does not matter in robot identifiers.


    3. "Disallow" lines also have an exact format:

    Disallow: leadingspec
    3. a. No wildcards of any sort are allowed in "Disallow" lines, except that a blank spec will be taken to signify "nothing whatever", that is, that all files are allowed the specified robot or robots.

    3. b. All specs in Disallow lines are considered to start at the root, and must be so entered. A robot checking the availability of a given file to it will match the file's full specification from the root on against all Disallow lines applicable to that robot; if a given Disallow spec matches for the length of the spec as given, it is a match. There is thus a sort of implicit "wildcard mark after every Disallow spec.

    Example:

    Disallow: /project
    will be taken by robots to match all of these--

    /project1/george/budget2.html
    /projected-figures/2006/quarter1.htm
    /projections.php​
    --but it will not match this:

    /work/project1/help.htm​

    4. Robots use the file entries sequentially: they "decide" if a filespec is disallowed based on the first "Disallow:" spec match they encounter, even if several such specs would match. This can be significant if there are multiple blocks applying to differing sets of robots--for example, a particular robot may expressly, by robot name, be allowed full access in a block that is later followed by a block addressing "all robots" that disallows various things; the named robot has already matched the spec and will never reach the "all-robots" disallowing block.


    5. Comments can be placed in a robots.txt file. In principle, all text on any line following a hash sign ( # ) , including the sign itself, is disregarded by robots as comment. Several places nonetheless advise that, for maximum safety (robots are not all equally "smart"), meaningful lines not have appended comments (that is, put any comment on a line entirely its own); for the paranoid worry-wart, it might be most comforting to avoid placing comment lines within directive blocks: put them above, below, or between.


    Note that there are some directives that are not a formal part of the standard (such as it is) but that are reasonably needed and are starting to be used and recognized by some major robots, even though a robots.txt syntax checker will spit them out. One, already mentioned, is an "Allow:" directive. Another is--

    Crawl-delay: seconds
    (where "seconds" is a number), which specifies the minimum allowed interval between hits by the robot (to avoid certain robots' tendency to bring servers to their knees by rapid-fire hits--M$ is notorious for doing that, and I hear that Yahoo has been known to do also; both, as best I recall--you could look it up--recognize and, supposedly, honor "Crawl-delay", which would be nice.)

    I hope that, one, all that is correct, and two, that it helps someone avoid the follies I went through (especially over 3b).
     
    Owlcroft, Dec 28, 2004 IP
  6. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #6
    That's correct: Yahoo Slurp and MSNBot do both honor the crawl-delay directives. Googlebot is usually better mannered and doesn't require it.

    That's a great summary, Owlcroft. Shawn should split that off and sticky it somewhere as a reference. Otherwise, I may steal it for another forum :D
     
    minstrel, Dec 28, 2004 IP
  7. crew

    crew Peon

    Messages:
    225
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #7
    Yeah, it appears there are alot of poor examples where they use things without explaining it. I am by no means an expert on robots.txt files, but in general with web programming, the leading slash signifies the root(web home) directory, so:

    Disallow: /help
    Code (markup):
    means that 'help' must be in the root directory, while
    Disallow: help
    Code (markup):
    applies to files or directories regardless of location relative to the root directory. I read your post on the Coop board and think this was the missing piece of info that those examples weren't telling you. (So some of those examples were right, but only when the specified file is in the root directory)
     
    crew, Dec 28, 2004 IP
  8. J.D.

    J.D. Peon

    Messages:
    1,198
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Not quite. There seems to be a typo in the robots.txt spec (http://www.robotstxt.org/wc/norobots-rfc.html). For those of you who understand BNF notation, here's the line:

    disallowline = "Disallow" ":" *space path [comment] CRLF

    The typo is that path should be rpath, which is defined later in the document as a path that *always* begins with a slash.

    In other words, Disallow: help is not a valid directive and it will never match any URLs.

    Here's how matching is done - robots compare byte by byte the URL and the directive paths. If all bytes (not characters!) up to the end of the directive path are the same, the URL matches the path.

    J.D.

    PS. I have notified robotstxt.org folks about this and will add to this post if I ever hear anything from them.
     
    J.D., Dec 28, 2004 IP
  9. hdlns

    hdlns Peon

    Messages:
    1
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #9
    I know this is an old message... but...

    Why not let apache process the robots.txt file (as an adjunct to the .htaccess)????

    That way, if the bot doesnt behave, it can really be cut off.

    I'm sure apache could very easily add this as a config option.

    Just my $.02.

    John

    HDLNS.com
     
    hdlns, Apr 29, 2005 IP
  10. mcfox

    mcfox Wind Maker

    Messages:
    7,526
    Likes Received:
    716
    Best Answers:
    0
    Trophy Points:
    360
    #10
    Some great advice on robots.txt so far.

    Sometimes, the best way to see how something works is to see a live example. Whitehouse robots.txt
     
    mcfox, Apr 29, 2005 IP