Robots.txt - hide sitewide HTML pages only

Epica Well-Known Member

Messages:: 1,007

Likes Received:: 95

Best Answers:: 0

Trophy Points:: 170

#1

Okay heres the deal -

I need to keep all HTML pages from being spidered. There are 700+ duplicate pages with only an agents name and number being different. They are in different folders so I can't just block the folders.

In particular I am concerned about how Google will follow this since it is Goole we are trying to keep from pinging us for duplicate content.

Can I just add this to the robots.txt ...
 user-agent: * 
 
Disallow: *.htm
Disallow: *.html
Code (markup):
or should it be
 user-agent: * 
 
Disallow: *.htm$
Disallow: *.html$
Code (markup):
or
 user-agent: * 
 
Disallow: /*.htm
Disallow: /*.html
Code (markup):
Will that work..?

Epica, Apr 20, 2005 IP

Epica Well-Known Member

Messages:: 1,007

Likes Received:: 95

Best Answers:: 0

Trophy Points:: 170

#2

anyone anyone ...?

I have to get back to someone on this asap - meeting coming up

I feel like a dork don't even know the proper syntax for a robots.txt file

PLUS I can't find it online anywhere...at least not at the 'file types' level.

arrgggh

Epica, Apr 20, 2005 IP

exam Peon

Messages:: 2,434

Likes Received:: 120

Best Answers:: 0

Trophy Points:: 0

#3

Robots.txt rules cannot contain any wildcards. * only referrs to any robot. It is not a wildcard.

The Disallow rule matches the beginning of any path

Disallow: /s

will disallow
myhost.com/s.html
myhost.com/s/
myhost.com/somedir
myhost.com/somepage.php

Robots.txt cannot do what you want to do.

exam, Apr 20, 2005 IP

Epica likes this.

noppid gunnin' for the quota

Messages:: 4,246

Likes Received:: 232

Best Answers:: 0

Trophy Points:: 135

#4

Just use noindex tags on those pages and you're all set.

noppid, Apr 20, 2005 IP

Epica Well-Known Member

Messages:: 1,007

Likes Received:: 95

Best Answers:: 0

Trophy Points:: 170

#5

Thanks for the responces guys!

I just found that these pages are generated dynamically using a unique url string ....what.... I'm getting lost on this - how do you create an HTML page dynamically...from a url....?

Some sort of HTML template with <includes> filled by info pulled from a database through the variable in the string....?

ME <------ Hopelessly lost at this point...

I THINK I can edit the template that these pages are generated through to include that Robots Meta.....I THINK....we'll see.

Epica, Apr 20, 2005 IP

Log in or Sign up

Robots.txt - hide sitewide HTML pages only

Epica Well-Known Member

Epica Well-Known Member

exam Peon

noppid gunnin' for the quota

Epica Well-Known Member

Useful Searches