1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Should I deny access to my site forum using googlebot?

Discussion in 'robots.txt' started by Yukio, Jun 21, 2005.

  1. #1
    Should I deny access to forum threads to googlebot?

    I'm thinking that if I don't... they'll spend more time indexing forum threads than the site pages itself (I've run a search engine spider script on my site and it started indexing the forum first and took fooorrrever).

    The forum has about.. 300,000 threads.
     
    Yukio, Jun 21, 2005 IP
  2. T0PS3O

    T0PS3O Feel Good PLC

    Messages:
    13,219
    Likes Received:
    777
    Best Answers:
    0
    Trophy Points:
    0
    #2
    No. Just make sure the forum links back to the site and they'll end up there too. Your main site is leikely to have more IBLs anyway hence more likely to get spidered more often.

    You can also try their Sitemap feature.
     
    T0PS3O, Jun 21, 2005 IP
  3. Yukio

    Yukio Peon

    Messages:
    137
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Alrightey ^_^ Thanks for letting me know. I'll remove the robot.txt file.
     
    Yukio, Jun 21, 2005 IP
  4. wrmineo

    wrmineo Peon

    Messages:
    3,087
    Likes Received:
    379
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Don't remove the robots.txt file!! This will prevent bots from being able to spider your site at all; modify to suit your needs, but do not remove.
     
    wrmineo, Jun 21, 2005 IP
  5. yfs1

    yfs1 User Title Not Found

    Messages:
    13,798
    Likes Received:
    922
    Best Answers:
    0
    Trophy Points:
    0
    #5
    I would respectfully disagree. Not a single one of my sites has a robots file and they are all fully indexed.

    For example:
    One of my sites
    http://www.google.ie/search?hl=en&q=site:www.yourfavouriteshop.com&meta=

    That site has never had a robots file ;)

    I routinely advise newbies not to use one as they always seem to mess it up causing problems with indexing. Make sure you understand the syntax before you put one up.
     
    yfs1, Jun 21, 2005 IP
  6. wrmineo

    wrmineo Peon

    Messages:
    3,087
    Likes Received:
    379
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Is it safe to assume that you also have a plethora of other sites linking to yours?

    I made the comment as the "new" MSN, and some others, seem to be "resistant" to crawling new sites and it seems to help to have the robots.txt file.

    I too have seen many sites well indexed in all major engines without the robots file, but many of those have been well established for sometime.

    For my own, and maybe others', education, what are some of the potential pitfalls of having the robots.txt file?

    Thanks!
     
    wrmineo, Jun 21, 2005 IP
  7. pwaring

    pwaring Well-Known Member

    Messages:
    846
    Likes Received:
    25
    Best Answers:
    0
    Trophy Points:
    135
    #7
    I don't think there are any pitfalls to having or not having a robots.txt file - some of my sites have them and some don't and I've never noticed a change when I've added/deleted one. I tend to add them to most of my sites nowadays though otherwise I end up with loads of 404 errors cluttering my logs from all the search engines requesting it by default, same goes for favicon.ico.
     
    pwaring, Jun 21, 2005 IP
  8. Josh

    Josh Peon

    Messages:
    893
    Likes Received:
    82
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Removing the robots.txt has no effect on the indexing of your page. None of my sites have a robots.txt file, and they've all been spidered by Google and every other large SE.


    Josh
     
    Josh, Jun 21, 2005 IP
  9. noppid

    noppid gunnin' for the quota

    Messages:
    4,246
    Likes Received:
    232
    Best Answers:
    0
    Trophy Points:
    135
    #9
    I guess we can put the hard to make a default to bed easily.

    I have seen, what I believe to be, better attention from having one. Maybe it's just perception, but I have a few examples that I won't share that are my proof.

    Anywho, if ya want one, use this and everything will be allowed spidered...

    
    User-agent: *
    Disallow:
    
    
    Code (markup):
    Save that as robots.txt and put it in your root folder. Note that there is a newline/return after Disallow:.

    Now no matter which you prefer, you can do it right. :)
     
    noppid, Jun 21, 2005 IP
  10. Josh

    Josh Peon

    Messages:
    893
    Likes Received:
    82
    Best Answers:
    0
    Trophy Points:
    0
    #10
    But no robots.txt file = disallow none, so why create something that sets a preference thats already set by deafult?


    Josh
     
    Josh, Jun 21, 2005 IP
  11. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #11
    Because whether or not you have one, Googlebot will check for it?

    The only downside is making a syntax error -- do what noppid says exactly in a text editor and you'll be fine.

    The original question, however, asked about a forum -- you can disallow certain files and folders in the forum to keep Googlebot on the actual posts pages and out of things like the members list and the admin and "process" files (posting, etc.).

    For example, a robots.txt file for a phpBB forum might look like this:

    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /media/
    Disallow: /misc/
    Disallow: /stats/
    Disallow: /phpbb/admin/ 
    Disallow: /phpbb/db/ 
    Disallow: /phpbb/images/ 
    Disallow: /phpbb/includes/ 
    Disallow: /phpbb/language/ 
    Disallow: /phpbb/profile.php 
    Disallow: /phpbb/groupcp.php 
    Disallow: /phpbb/memberlist.php 
    Disallow: /phpbb/login.php 
    Disallow: /phpbb/modcp.php 
    Disallow: /phpbb/posting.php 
    Disallow: /phpbb/privmsg.php 
    Disallow: /phpbb/search.php 
    
    Code (markup):
     
    minstrel, Jun 21, 2005 IP
  12. yfs1

    yfs1 User Title Not Found

    Messages:
    13,798
    Likes Received:
    922
    Best Answers:
    0
    Trophy Points:
    0
    #12
    I agree about the syntax error which has no effect on SEO but can be annoying for some people.

    Its still my opinion that that is the only reason you may need a robots file if you want your whole site indexed.
     
    yfs1, Jun 22, 2005 IP
  13. pwaring

    pwaring Well-Known Member

    Messages:
    846
    Likes Received:
    25
    Best Answers:
    0
    Trophy Points:
    135
    #13
    As I said earlier in the thread, all the reputable search engines will check for its existence anyway, so if you don't have one (even if it's just a blank file), you'll end up with loads of 404 errors clogging up your logs. Creating a "default" robots.txt stops this.
     
    pwaring, Jun 22, 2005 IP
  14. ziandra

    ziandra Well-Known Member

    Messages:
    142
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    138
    #14
    Me too. I don't like seeing 404 errors in awstats. makes me go look in the logs to see what link I broke. Besides, it makes me feel good to explicitly invite the robots in and not just leave the door open and imply that it is ok.
     
    ziandra, Jun 26, 2005 IP
  15. exam

    exam Peon

    Messages:
    2,434
    Likes Received:
    120
    Best Answers:
    0
    Trophy Points:
    0
    #15
    You can also have an empty file called robots.txt and it won't affect spidering, but depending on the size of your error doc, you could save a little bit of bandwidth.
     
    exam, Jun 26, 2005 IP
  16. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #16
    I really wouldn't advise using an empty robots.txt file -- that may confuse some spiders -- either omit it altogether, which spiders know how to handle, or include the all-purpose "go ahead and spider everything" version posted earlier:

    User-agent: *
    Disallow: 
    
    
    Code (markup):
     
    minstrel, Jun 26, 2005 IP
  17. pwaring

    pwaring Well-Known Member

    Messages:
    846
    Likes Received:
    25
    Best Answers:
    0
    Trophy Points:
    135
    #17
    Why would it confuse spiders any more than a lack of a robots.txt file? They'll download it, read in the rules and default to "allowed to crawl anything" if there's nothing there. I very much doubt that any of the major search engines would choke on an empty file, and any robot that does is probably poorly written anyway.
     
    pwaring, Jun 26, 2005 IP
  18. minstrel

    minstrel Illustrious Member

    Messages:
    15,082
    Likes Received:
    1,243
    Best Answers:
    0
    Trophy Points:
    480
    #18
    Have you looked at the code for all spiders?

    I'm not certain that it WOULD confuse them - I'd be concerned that it MIGHT, as I said.

    I also do not see the point of an empty robots.txt file. Why even have it? We do know that spiders can handle NO robots.txt file just fine so there is zero risk with not having one other tham the 404 log lines. If you are going to bother creating and uploading a robots.txt file at all (which I do recommend), why not put those two lines in it (see above)?

    Are you that short of server space?
     
    minstrel, Jun 27, 2005 IP
  19. exam

    exam Peon

    Messages:
    2,434
    Likes Received:
    120
    Best Answers:
    0
    Trophy Points:
    0
    #19
    From robotstxt.org http://www.robotstxt.org/wc/exclusion-admin.html
    From Google http://www.google.com/bot.html#norobots
     
    exam, Jun 27, 2005 IP
  20. exam

    exam Peon

    Messages:
    2,434
    Likes Received:
    120
    Best Answers:
    0
    Trophy Points:
    0
    #20
    Depending on you server's 404 error page, you could save a *little* bandwidth by using an empty robots.txt file. (The error page might be 1-2k but an empty robots.txt file will be 0) Also, I have sites with empty robots.txt files that are indexed by G,Y & MSN and others, so I think that's evidence enought that there's not a problem. (At least with the robots that matter) :) If you want to type in the two lines to allow access to everything, go right ahead or take the lazy man's way out and upload a blank file. It does the same thing.
     
    exam, Jun 27, 2005 IP