If I'm running the latest phpBB forum, which parts should I have included in robots.txt so they don't get spidered? I don't want all that junk being in the index, just the useful posts. If anyone has an actual example of a robots.txt with that, it would be great
Here's one of mine: User-agent: * Disallow: /forums/admin/ Disallow: /forums/images/ Disallow: /forums/includes/ Disallow: /forums/language/ Disallow: /forums/templates/ Disallow: /forums/common.php Disallow: /forums/config.php Disallow: /forums/groupcp.php Disallow: /forums/memberlist.php Disallow: /forums/modcp.php Disallow: /forums/posting.php Disallow: /forums/profile.php Disallow: /forums/privmsg.php Disallow: /forums/viewonline.php Disallow: /forums/search.php Disallow: /forums/faq.php
robots.txt is great for controlling the indexing of bots, for your site, if you have duplicate content elsewhere on the web. Duplicate content is bad, bad bad!
Mine is more compact: User-agent: * Disallow: /forum/admin/ Disallow: /forum/includes/ Disallow: /forum/common.php Disallow: /forum/config.php Disallow: /forum/groupcp.php Disallow: /forum/memberlist.php Disallow: /forum/modcp.php Disallow: /forum/profile.php Code (markup):
Sorry for the ancient bump, but what how would I go about blocking the "viewtopic.php?p=<postnumber>" links? Mainly to avoid duplicate content penalties.
There is no duplicate content penalty. There is a duplicate content filter which will index one page and ignore other pages with identical content. Unless Google indexes the "wrong" page, you don't need to worry. How does "viewtopic.php?p=<postnumber>" create a duplicate content issue?
A topic is usually "viewtopic.php?t=<threadID>" - however, each individual post on that page, the first page of the thread, can be linked to using "viewtopic.php?p=<postID>#<postID>". Google of course ignores the #postID, referring only to a specific area of the page. However, with probably 20 posts per page, there's no content difference (at ALL) on the post pages versus the thread page itself.
I think you're worrying way too much about the duplicate content issue here. Remember, it's a filter, not a penalty. You don't really want to eliminate the individual post URLs, since other sites will legitmately want to link to either a specific thread OR to a specific post. And you don't NEED to eliminate individual post URLs - all links will take you to the thread one way or another, which is after all the whole idea.
Do not forget this one : It's probably the most important rule actually since the same message can be accessible with /forum/viewtopic.php?p=<id> and /forum/viewtopic.php?t=<id>