Is robots.txt observed?

Discussion in 'Co-op Advertising Network' started by Owlcroft, Dec 26, 2004.

  1. #1
    I just got an email stating that one of my sites was being dropped because no co-op ad could be found on the following page:


    But my robots.txt file contains the lines:

    User-agent: *
    Disallow: /abe.php​

    The cited URL is not a "page" of that site at all: it is just a forwarder script, linking to a page of the ABE site via Commission Junction. (And it is not in Google's archive.)

    What can I do to deal with this situation?
     
    Owlcroft, Dec 26, 2004 IP
  2. digitalpoint

    digitalpoint Overlord of no one Staff

    Messages:
    38,334
    Likes Received:
    2,613
    Best Answers:
    462
    Trophy Points:
    710
    Digital Goods:
    29
    #2
    digitalpoint, Dec 26, 2004 IP
  3. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #3
    It's based on what Google knows about:
    http://www.google.com/search?hl=en&...table-site.com+


    But, if I understand the purpose of robots.txt aright, how did Google come to know of those pages? (The file has been unchanged since that pass-through script went up.)

    Or, more on point, what should I do here? I am being penalized for not running ads on pages that are not pages of my site. Most of my sites have similar pass-through php scripts used on a fair percentage of pages: are all such sites impossible to keep in the co-op network?
     
    Owlcroft, Dec 26, 2004 IP
  4. digitalpoint

    digitalpoint Overlord of no one Staff

    Messages:
    38,334
    Likes Received:
    2,613
    Best Answers:
    462
    Trophy Points:
    710
    Digital Goods:
    29
    #4
    Not sure exactly... but if Google doesn't adhere to your robots.txt file, you might want to shoot them an email.
     
    digitalpoint, Dec 26, 2004 IP
  5. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Well, I did a lot of homework on robots.txt files, and discover that there is contradictory advice out there; so, it may be that my robots.txt file (there and on other sites) was defective. I have made what I hope are valid corrections, and asked G for a forced (immediate) robots.txt-based exclusion update.

    I wonder if this deserves a thread elsewhere, or if I'm the only fool in the world. More than one apparently authoritative source states that to block a particular file, one uses the form:

    Disallow: /filename.ext

    But others say to just use:

    Disallow: filename.ext

    While yet others say to use the form:

    Disallow: /pahtlevel1/pathlevel2/filename.ext

    I was using the first (slash-filename.ext), but have switched to the third (/fullpath/filename.ext) and will see what happens with that particular site over the next 24 hours.
     
    Owlcroft, Dec 26, 2004 IP
  6. Cardplayer

    Cardplayer Peon

    Messages:
    53
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #6
    I'm having the same problem.

    This is my robots.txt file
    
    User-agent: Titan 
    Disallow: / 
    
    User-agent: EmailCollector 
    Disallow: / 
    
    User-agent: EmailSiphon 
    Disallow: / 
    
    User-agent: EmailWolf 
    Disallow: / 
    
    User-agent: ExtractorPro 
    Disallow: / 
    
    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /search.asp
    Disallow: /search.php
    Disallow: /jump.php
    Disallow: /contact.php
    
    Code (markup):
    jump.php is my forwarder script, but is indexed by Google many times over and causes my Coop ads to be rejected once in awhile. It's located in the root so /jump.php would be the full path. Not sure what else to do for it.
     
    Cardplayer, Dec 27, 2004 IP
  7. exam

    exam Peon

    Messages:
    2,434
    Likes Received:
    120
    Best Answers:
    0
    Trophy Points:
    0
    #7
    Yup,
    
    User-agent: *
    Disallow: /abe.php
    
    Code (markup):
    will disallow every url that begins with
    http://the-vegetable-site.com/abe.php

    You need

    
    User-agent: *
    Disallow: /vegetable-books/abe.php
    
    Code (markup):
    If in doubt use the site that Google quotes (robotstxt.org) as your reference
     
    exam, Dec 27, 2004 IP
  8. Owlcroft

    Owlcroft Peon

    Messages:
    645
    Likes Received:
    34
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Yes, interesting. I took this issue to a new thread elsewhere on "robots.txt". It seems to be an excellent example of hubris on my part, and possibly that of others, to have assumed that robots.txt has an "obvious" syntax, but, in mitigation of my folly, there is an *awful* lot of erroneous "information" out there on the web.
     
    Owlcroft, Dec 28, 2004 IP