robots.txt and duplicate content of wordpress

Discussion in 'Search Engine Optimization' started by sandrodz, Nov 18, 2007.

  1. #1
    Hi, so I found this neat robot.txt file on the wordpress website, what do you think? is it worth using? does it look allright to you? can u please explain the part at the bottom, why would I block adsense? digg?

    User-agent: *
    Disallow: /cgi-bin
    Disallow: /wp-admin
    Disallow: /wp-includes
    Disallow: /wp-content/plugins
    Disallow: /wp-content/cache
    Disallow: /wp-content/themes
    Disallow: /trackback
    Disallow: /feed
    Disallow: /comments
    Disallow: /category/*/*
    Disallow: */trackback
    Disallow: */feed
    Disallow: */comments
    Disallow: /*?*
    Disallow: /*?
    Allow: /wp-content/uploads
    
    # Google Image
    User-agent: Googlebot-Image
    Disallow:
    Allow: /*
    
    # Google AdSense
    User-agent: Mediapartners-Google*
    Disallow:
    Allow: /*
    
    # Internet Archiver Wayback Machine
    User-agent: ia_archiver
    Disallow: /
    
    # digg mirror
    User-agent: duggmirror
    Disallow: /
    
    Sitemap: http://www.sandrophoto.com/sitemap.xml
    PHP:

     
    sandrodz, Nov 18, 2007 IP
  2. Chewyshoe

    Chewyshoe Peon

    Messages:
    401
    Likes Received:
    24
    Best Answers:
    0
    Trophy Points:
    0
    #2
    It's disallowing google images and adsense, it's not disallowing digg, it's disallowing diggmirror. Diggmirror just copies and pastes your sites information onto a seperate server so it doesn't go down when the visitors start coming in.

    It disallows googe adsense because at the moment wordpress hosted sites don't allow advertising of any form, this is because they are (rumour) trying to break a deal with google adsense to be their one and only advertiser.

    It disallows google images because image traffic is largely useless and it uses up a lot of bandwidth.

    Hope that helps :).
     
    Chewyshoe, Nov 18, 2007 IP
    sandrodz likes this.
  3. sandrodz

    sandrodz Peon

    Messages:
    1,482
    Likes Received:
    29
    Best Answers:
    0
    Trophy Points:
    0
    #3
    so basically if I have only this its fine?

    User-agent: *
    Disallow: /cgi-bin
    Disallow: /wp-admin
    Disallow: /wp-includes
    Disallow: /wp-content/plugins
    Disallow: /wp-content/cache
    Disallow: /wp-content/themes
    Disallow: /trackback
    Disallow: /feed
    Disallow: /comments
    Disallow: /category/*/*
    Disallow: */trackback
    Disallow: */feed
    Disallow: */comments
    Disallow: /*?*
    Disallow: /*?
    Allow: /wp-content/uploads
     
    sandrodz, Nov 18, 2007 IP
  4. Sebastian

    Sebastian Peon

    Messages:
    11
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #4
    I'd allow the category pages, but disallow the archives:
    Disallow: /2005/
    Disallow: /2006/
    Disallow: /2007/
    Disallow: /2008/
    Disallow: /2009/
    Disallow: /2010/
    Category pages can attract nice long tail search queries, that's not likely to happen with monthly archives listing posts of all categories.
     
    Sebastian, Nov 19, 2007 IP
  5. SeoSmarty

    SeoSmarty Banned

    Messages:
    13
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Why use robots.txt for that? From my experience, Noindex is much better to avoid blog dup content problem... and it's recommended by googlers by the way. A really good post on it commented by Matt Cutts (Sphin):

    sphinn.com/story/8667

    (look at the comment #6)
     
    SeoSmarty, Nov 19, 2007 IP