I have a site that has about 3500 pages indexed but 3400 pages are supplemental. What I would like to do is block (with robots.txt) about 2000 product pages that are not the best quality (short descriptions, etc). Now it is impossible to block these pages with a single line such as: Disallow: /catalog/ because I have 600 products that I want to be allowed. Could I put all of the product urls that I want to disallow into robots.txt? I don't know if google could handle a robots.txt that big or not. Has anyone tried this.
I didn't think about that!! The only problem is that is will take awhile to compile the list and if it doesn't work it would a waste of time. I might try it though.
Add 600 random non-existing addresses/directories first. Would take about 5 minutes to write a php program that would spit those out, then just view source and copy and paste into robots.txt. If it is going to balk at reading the file, it won't matter if they actually exist or not. -Michael
Well, they have a limit of 5000 characters on the Webmaster tool and I have done alittle searching on Google and most people say to keep the file below 15k. So if anyone has any ideas I would like to hear them. The product pages are oscommerce with modrewrite in case that helps.
If you're just trying to funnel PageRank only to the important pages, then why don't you use nofollow on the ones you don't want PageRank to go to? Alternatively, you could just boost the sections that are in the supps, and get them into the regular index. -Michael
just curious...why do you want to block these pages? supplemental results are not something to be afraid of...
It is well known that if you block the unimportant pages the rankings will go up on your important pages that are not blocked because the get more link power. Check out www.seobook.com/archives/001545.shtml
It's dead weight... the PageRank a page has that can be passed on to other pages is divided among all links on a page. If some of the links point to supps, which won't then pass any on themselves to other pages of the site, then it's wasted. However, I think you need to focus on the linking, rather than on whether or not the pages are in the index. I'm pretty sure that blocking a page with robots.txt does not block Google form assigning a portion of the PageRank to the links pointing to those pages. -Michael