Mortgage Calculator - Mobile Phones - Credit Card - Web Advertising - Loans

PDA

View Full Version : Robots.txt - hide sitewide HTML pages only


AzAkers
Apr 20th 2005, 11:27 am
Okay heres the deal -

I need to keep all HTML pages from being spidered. There are 700+ duplicate pages with only an agents name and number being different. They are in different folders so I can't just block the folders.

In particular I am concerned about how Google will follow this since it is Goole we are trying to keep from pinging us for duplicate content.

Can I just add this to the robots.txt ...

user-agent: *

Disallow: *.htm
Disallow: *.html



or should it be

user-agent: *

Disallow: *.htm$
Disallow: *.html$



or

user-agent: *

Disallow: /*.htm
Disallow: /*.html



Will that work..?

AzAkers
Apr 20th 2005, 11:47 am
anyone anyone ...? :(

I have to get back to someone on this asap - meeting coming up

I feel like a dork don't even know the proper syntax for a robots.txt file :(

PLUS I can't find it online anywhere...at least not at the 'file types' level.

arrgggh

exam
Apr 20th 2005, 3:56 pm
Robots.txt rules cannot contain any wildcards. * only referrs to any robot. It is not a wildcard.

The Disallow rule matches the beginning of any path

Disallow: /s

will disallow
myhost.com/s.html
myhost.com/s/
myhost.com/somedir
myhost.com/somepage.php

Robots.txt cannot do what you want to do.

noppid
Apr 20th 2005, 3:59 pm
Just use noindex tags on those pages and you're all set. :)

AzAkers
Apr 20th 2005, 4:02 pm
Thanks for the responces guys! :)

I just found that these pages are generated dynamically using a unique url string :(....what.... I'm getting lost on this - how do you create an HTML page dynamically...from a url....?

Some sort of HTML template with <includes> filled by info pulled from a database through the variable in the string....?

ME <------ Hopelessly lost at this point...

I THINK I can edit the template that these pages are generated through to include that Robots Meta.....I THINK....we'll see.