Hi All, We are taking a bit of a hit on Google due to the amount of duplicate content on our site. Therefore as part of a phased approach we want to start removing non-essential pages from the eyes of Googlebot. Basically, our first action is to disallow Googlebot from crawling/indexing any product page that is a 4th generation copy (or more). We have done some initial research and think that a use of the * wildcard function as per below should be ok... This is the proposed code we are thinking of using: User-agent: * Disallow: /copy_of_copy_of_copy_of_*.html Therefore: http://www.kjbeckett.com/acatalog/bl...red-perry.html WOULD be crawled (from http://www.kjbeckett.com/acatalog/fred-perry_p2.html). http://www.kjbeckett.com/acatalog/co...red-perry.html WOULD be crawled (from http://www.kjbeckett.com/acatalog/mens-bags_p4.html). http://www.kjbeckett.com/acatalog/co...red-perry.html WOULD be crawled (from http://www.kjbeckett.com/acatalog/mens-bags.html). http://www.kjbeckett.com/acatalog/co...red-perry.html WOULD NOT be crawled (from http://www.kjbeckett.com/acatalog/messenger-bags.html). http://www.kjbeckett.com/acatalog/co...red-perry.html WOULD NOT be crawled (from http://www.kjbeckett.com/acatalog/fred-perry.html). Do you think our usage of the * wildcard is correct? Therefore, using the examples above, would we still be crawled where we want to, and not crawled where we don’t want to? Any help would be greatly appreciated. Cheers, Liam
rather than mess with robots.txt which can have a bad affect on the site....why not add canonical tags to those pages. Or even the noindex tag...that would be much easier.
Hi warneylm. As mentioned above rather use canonical or . Blocking URL's via robots.txt is no guarantee that they won't reappear in the search results. If other sites have linked to those pages the bot will follow those links and index them again. Also remembering if you block via robots.txt and via the meta tags then the spider may never get to crawl the page to see the noindex meta tags, so the URL may still appear in the search results and come up as copy. So be careful how you block the spider. Regards, knysna.
Sorry the beginning of the above post never came out. It was meant to read. As mentioned above rather use canonical or the noindex,nofollow meta tag.