Okay heres the deal - I need to keep all HTML pages from being spidered. There are 700+ duplicate pages with only an agents name and number being different. They are in different folders so I can't just block the folders. In particular I am concerned about how Google will follow this since it is Goole we are trying to keep from pinging us for duplicate content. Can I just add this to the robots.txt ... user-agent: * Disallow: *.htm Disallow: *.html Code (markup): or should it be user-agent: * Disallow: *.htm$ Disallow: *.html$ Code (markup): or user-agent: * Disallow: /*.htm Disallow: /*.html Code (markup): Will that work..?
anyone anyone ...? I have to get back to someone on this asap - meeting coming up I feel like a dork don't even know the proper syntax for a robots.txt file PLUS I can't find it online anywhere...at least not at the 'file types' level. arrgggh
Robots.txt rules cannot contain any wildcards. * only referrs to any robot. It is not a wildcard. The Disallow rule matches the beginning of any path Disallow: /s will disallow myhost.com/s.html myhost.com/s/ myhost.com/somedir myhost.com/somepage.php Robots.txt cannot do what you want to do.
Thanks for the responces guys! I just found that these pages are generated dynamically using a unique url string ....what.... I'm getting lost on this - how do you create an HTML page dynamically...from a url....? Some sort of HTML template with <includes> filled by info pulled from a database through the variable in the string....? ME <------ Hopelessly lost at this point... I THINK I can edit the template that these pages are generated through to include that Robots Meta.....I THINK....we'll see.