1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Robots.txt - hide sitewide HTML pages only

Discussion in 'robots.txt' started by Epica, Apr 20, 2005.

  1. #1
    Okay heres the deal -

    I need to keep all HTML pages from being spidered. There are 700+ duplicate pages with only an agents name and number being different. They are in different folders so I can't just block the folders.

    In particular I am concerned about how Google will follow this since it is Goole we are trying to keep from pinging us for duplicate content.

    Can I just add this to the robots.txt ...

     user-agent: * 
     
    Disallow: *.htm
    Disallow: *.html
    
    Code (markup):

    or should it be

     user-agent: * 
     
    Disallow: *.htm$
    Disallow: *.html$
    
    Code (markup):

    or

     user-agent: * 
     
    Disallow: /*.htm
    Disallow: /*.html
    
    Code (markup):

    Will that work..?
     
    Epica, Apr 20, 2005 IP
  2. Epica

    Epica Well-Known Member

    Messages:
    1,007
    Likes Received:
    95
    Best Answers:
    0
    Trophy Points:
    170
    #2
    anyone anyone ...? :(

    I have to get back to someone on this asap - meeting coming up

    I feel like a dork don't even know the proper syntax for a robots.txt file :(

    PLUS I can't find it online anywhere...at least not at the 'file types' level.

    arrgggh
     
    Epica, Apr 20, 2005 IP
  3. exam

    exam Peon

    Messages:
    2,434
    Likes Received:
    120
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Robots.txt rules cannot contain any wildcards. * only referrs to any robot. It is not a wildcard.

    The Disallow rule matches the beginning of any path

    Disallow: /s

    will disallow
    myhost.com/s.html
    myhost.com/s/
    myhost.com/somedir
    myhost.com/somepage.php

    Robots.txt cannot do what you want to do.
     
    exam, Apr 20, 2005 IP
    Epica likes this.
  4. noppid

    noppid gunnin' for the quota

    Messages:
    4,246
    Likes Received:
    232
    Best Answers:
    0
    Trophy Points:
    135
    #4
    Just use noindex tags on those pages and you're all set. :)
     
    noppid, Apr 20, 2005 IP
  5. Epica

    Epica Well-Known Member

    Messages:
    1,007
    Likes Received:
    95
    Best Answers:
    0
    Trophy Points:
    170
    #5
    Thanks for the responces guys! :)

    I just found that these pages are generated dynamically using a unique url string :(....what.... I'm getting lost on this - how do you create an HTML page dynamically...from a url....?

    Some sort of HTML template with <includes> filled by info pulled from a database through the variable in the string....?

    ME <------ Hopelessly lost at this point...

    I THINK I can edit the template that these pages are generated through to include that Robots Meta.....I THINK....we'll see.
     
    Epica, Apr 20, 2005 IP