Opinion Regarding Robots.text Change.

Discussion in 'Google' started by warneylm, Apr 10, 2012.

  1. #1
    Hi All,

    We are taking a bit of a hit on Google due to the amount of duplicate content on our site. Therefore as part of a phased approach we want to start removing non-essential pages from the eyes of Googlebot.

    Basically, our first action is to disallow Googlebot from crawling/indexing any product page that is a 4th generation copy (or more). We have done some initial research and think that a use of the * wildcard function as per below should be ok...

    This is the proposed code we are thinking of using:

    User-agent: *
    Disallow: /copy_of_copy_of_copy_of_*.html

    Therefore:

    http://www.kjbeckett.com/acatalog/bl...red-perry.html WOULD be crawled (from http://www.kjbeckett.com/acatalog/fred-perry_p2.html).
    http://www.kjbeckett.com/acatalog/co...red-perry.html WOULD be crawled (from http://www.kjbeckett.com/acatalog/mens-bags_p4.html).
    http://www.kjbeckett.com/acatalog/co...red-perry.html WOULD be crawled (from http://www.kjbeckett.com/acatalog/mens-bags.html).
    http://www.kjbeckett.com/acatalog/co...red-perry.html WOULD NOT be crawled (from http://www.kjbeckett.com/acatalog/messenger-bags.html).
    http://www.kjbeckett.com/acatalog/co...red-perry.html WOULD NOT be crawled (from http://www.kjbeckett.com/acatalog/fred-perry.html).

    Do you think our usage of the * wildcard is correct? Therefore, using the examples above, would we still be crawled where we want to, and not crawled where we don’t want to?

    Any help would be greatly appreciated.

    Cheers,
    Liam
     
    warneylm, Apr 10, 2012 IP
  2. trosquin

    trosquin Active Member

    Messages:
    681
    Likes Received:
    9
    Best Answers:
    0
    Trophy Points:
    60
    #2
    rather than mess with robots.txt which can have a bad affect on the site....why not add canonical tags to those pages. Or even the noindex tag...that would be much easier.
     
    trosquin, Apr 10, 2012 IP
  3. knysna

    knysna Peon

    Messages:
    81
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Hi warneylm. As mentioned above rather use canonical or . Blocking URL's via robots.txt is no guarantee that they won't reappear in the search results. If other sites have linked to those pages the bot will follow those links and index them again. Also remembering if you block via robots.txt and via the meta tags then the spider may never get to crawl the page to see the noindex meta tags, so the URL may still appear in the search results and come up as copy. So be careful how you block the spider. Regards, knysna.
     
    knysna, Apr 11, 2012 IP
  4. knysna

    knysna Peon

    Messages:
    81
    Likes Received:
    4
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Sorry the beginning of the above post never came out. It was meant to read. As mentioned above rather use canonical or the noindex,nofollow meta tag.
     
    knysna, Apr 11, 2012 IP