Robots.txt - hide sitewide HTML pages only

Discussion in 'robots.txt' started by Epica, Apr 20, 2005.

  1. #1
    Okay heres the deal -

    I need to keep all HTML pages from being spidered. There are 700+ duplicate pages with only an agents name and number being different. They are in different folders so I can't just block the folders.

    In particular I am concerned about how Google will follow this since it is Goole we are trying to keep from pinging us for duplicate content.

    Can I just add this to the robots.txt ...

     user-agent: * 
     
    Disallow: *.htm
    Disallow: *.html
    
    Code (markup):

    or should it be

     user-agent: * 
     
    Disallow: *.htm$
    Disallow: *.html$
    
    Code (markup):

    or

     user-agent: * 
     
    Disallow: /*.htm
    Disallow: /*.html
    
    Code (markup):

    Will that work..?
     
    Epica, Apr 20, 2005 IP
  2. Epica

    Epica Well-Known Member

    Messages:
    1,007
    Likes Received:
    95
    Best Answers:
    0
    Trophy Points:
    170
    #2
    anyone anyone ...? :(

    I have to get back to someone on this asap - meeting coming up

    I feel like a dork don't even know the proper syntax for a robots.txt file :(

    PLUS I can't find it online anywhere...at least not at the 'file types' level.

    arrgggh
     
    Epica, Apr 20, 2005 IP
  3. exam

    exam Peon

    Messages:
    2,434
    Likes Received:
    120
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Robots.txt rules cannot contain any wildcards. * only referrs to any robot. It is not a wildcard.

    The Disallow rule matches the beginning of any path

    Disallow: /s

    will disallow
    myhost.com/s.html
    myhost.com/s/
    myhost.com/somedir
    myhost.com/somepage.php

    Robots.txt cannot do what you want to do.
     
    exam, Apr 20, 2005 IP
    Epica likes this.
  4. noppid

    noppid gunnin' for the quota

    Messages:
    4,246
    Likes Received:
    232
    Best Answers:
    0
    Trophy Points:
    135
    #4
    Just use noindex tags on those pages and you're all set. :)
     
    noppid, Apr 20, 2005 IP
  5. Epica

    Epica Well-Known Member

    Messages:
    1,007
    Likes Received:
    95
    Best Answers:
    0
    Trophy Points:
    170
    #5
    Thanks for the responces guys! :)

    I just found that these pages are generated dynamically using a unique url string :(....what.... I'm getting lost on this - how do you create an HTML page dynamically...from a url....?

    Some sort of HTML template with <includes> filled by info pulled from a database through the variable in the string....?

    ME <------ Hopelessly lost at this point...

    I THINK I can edit the template that these pages are generated through to include that Robots Meta.....I THINK....we'll see.
     
    Epica, Apr 20, 2005 IP