Disallow All Pages Except Homepage

Discussion in 'robots.txt' started by AzureHaze, Jul 22, 2009.

  1. #1
    Hi, can anyone help me out with this?

    How to use robots.txt to index only the homepage and disallow/block all the other pages from search engines?

    I can't seem to find a proper answer for this anywhere.

    Thanks in advance.
     
    AzureHaze, Jul 22, 2009 IP
  2. dmi

    dmi Well-Known Member

    Messages:
    2,705
    Likes Received:
    51
    Best Answers:
    0
    Trophy Points:
    140
    #2
    
    User-agent: * 
    Disallow: /
    Allow: /index.html
    
    Code (markup):
    Try that and let us know if it works.
     
    dmi, Jul 23, 2009 IP
  3. jabz.biz

    jabz.biz Active Member

    Messages:
    384
    Likes Received:
    6
    Best Answers:
    1
    Trophy Points:
    70
    #3
    No, this should not work as index.html in in root (the slash). Also, this would make search engines rank the homepage for the file index.html. Unless you rewrite from root to index.html anyways (via .htaccess for example), this is confusing for search engine crawlers and bots.

    Show me your website and I give you a solution.
     
    jabz.biz, Jul 28, 2009 IP
  4. AzureHaze

    AzureHaze Peon

    Messages:
    171
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Thanks for the replies, I was actually testing out a wordpress theme on my personal blog, which I find is not very search engine friendly but it's not really a big of a problem because it's just my personal site and I don't put much content on it.

    I've used my robots.txt to block search engines from indexing the wp-content/themes directory because the theme somehow doesn't point post pages to their original urls but points to certain urls within the theme's directory instead.

    Here's the link to my site: Azure Haze
    Let me know if you have any idea to make it more search engine friendly and btw, the theme is Folio Elements from Press75.com

    Thanks.
     
    AzureHaze, Jul 29, 2009 IP
  5. jabz.biz

    jabz.biz Active Member

    Messages:
    384
    Likes Received:
    6
    Best Answers:
    1
    Trophy Points:
    70
    #5
    Everything but your homepage is in wp-content - the robots.txt should look like this:

    
    User-agent: *
    Disallow: /wp*
    Disallow: /feed/
    
    Code (markup):
    This should allow indexing of your homepage but not the rest of the content.
     
    jabz.biz, Jul 30, 2009 IP
  6. AzureHaze

    AzureHaze Peon

    Messages:
    171
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Thanks for the advice.

    So the star sign * in /wp* will block every directory tht starts with wp, including /wp-content, /wp-admin, /wp-include? Does other crawlers than googlebot recognize this function?

    So far I don't see any major problem in my G webmasters account. I'll wait for a few more days to see if there's any changes.

    One more question, if I'm not using the url removal tool in G webmasters, the old/unused pages that have been indexed will disappear after a certain period of time, correct?
     
    AzureHaze, Jul 30, 2009 IP
  7. jabz.biz

    jabz.biz Active Member

    Messages:
    384
    Likes Received:
    6
    Best Answers:
    1
    Trophy Points:
    70
    #7
    After big spy Google picks up your new robots.txt those pages should be de-indexed.
     
    jabz.biz, Jul 30, 2009 IP
  8. dmi

    dmi Well-Known Member

    Messages:
    2,705
    Likes Received:
    51
    Best Answers:
    0
    Trophy Points:
    140
    #8
    They won't be deindexed. Instructions in robots.txt makes the robots stop from futher crawling, but they don't tell search engines to deindex those pages. Meta noindex is need for that.
     
    dmi, Aug 1, 2009 IP
  9. AzureHaze

    AzureHaze Peon

    Messages:
    171
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #9
    Thanks, thanks for the advice. Being more specific, I meant pages that don't exist anymore, do they get de-indexed after a period of time?
     
    AzureHaze, Aug 1, 2009 IP
  10. jabz.biz

    jabz.biz Active Member

    Messages:
    384
    Likes Received:
    6
    Best Answers:
    1
    Trophy Points:
    70
    #10
    Pages that do not exist anymore need to give the search engine crawler a 404-error-page. You can solve that problem using .htaccess. If you put your errorpages in folders (e.g. 404) than you need to add this to your .htaccess file:

    After the search engines have picked that up, the non-existing pages stop appearing in the SERPs. Be patient. ;)
     
    Last edited: Aug 5, 2009
    jabz.biz, Aug 5, 2009 IP
  11. AzureHaze

    AzureHaze Peon

    Messages:
    171
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #11
    I used .htaccess to redirect my 404s to my homepage, is it okay to do so? Does it make any difference if I direct them to a custom 404 page?

    Thanks for the advice.
     
    AzureHaze, Aug 5, 2009 IP
  12. premiumscripts

    premiumscripts Peon

    Messages:
    1,062
    Likes Received:
    48
    Best Answers:
    0
    Trophy Points:
    0
    #12
    Well, the user experience will be better if you show them a custom 404 page obviously.
     
    premiumscripts, Aug 5, 2009 IP
  13. Professional Dude

    Professional Dude Prominent Member

    Messages:
    6,261
    Likes Received:
    430
    Best Answers:
    0
    Trophy Points:
    330
    #13
    I have a similar question, how can i remove all pages except homepage, also its not a wordpress site, otherwise I would have used the code by jabz.biz

    Any ideas?
     
    Professional Dude, Aug 23, 2009 IP
  14. Exa

    Exa Active Member

    Messages:
    471
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    85
    #14
    What he did was to disallow the directories and other files. So if your root directory is something like

    folder1/
    folder2/
    test/
    index.html
    anotherpage.html
    
    Code (markup):
    You should enter something like
    User-agent: *
    Disallow: /folder*
    Disallow: /test/
    Disallow: /anotherpage.html
    Code (markup):
     
    Exa, Sep 1, 2009 IP
  15. jabz.biz

    jabz.biz Active Member

    Messages:
    384
    Likes Received:
    6
    Best Answers:
    1
    Trophy Points:
    70
    #15
    No, this way a Search Engine does not understand, that this site does not exist anymore. Setup a 404 error page and offer the user some links where he/she can find what he/she is looking for.
     
    jabz.biz, Sep 8, 2009 IP