I have a folder/directory on one of my sites where I try to sell digital downloads. It has something like a 1000 different sample pages. In the robot text, I stopped google from indexing this. Because: 1. Every sample page in that directory is basically the same template for a different sample. The only thing that changes is a 3-4 sentence description of the sample. 2. Every page has an outgoing link to a different domain where they can purchase the download. I didn't want this to be seen as excessive cross linking (that would be 1000 outgoing links tom the same site). 3. The samples are all images. So there is really little content. Is blocking these pages from being indexed, the right move?
Yes, IMO you should disallow the entire folder. Remember, disallowing really means 'do not index' so I wouldn't try to hide any text, links, spam or otherwise within those pages. Also, I would make sure the the page's meta tag for robots reads NOINDEX NOFOLLOW. That leaves you squeeky clean.
block on robots.txt can make googlebots not see, but google will still claw as part of performance check and anti fraud.
Actually disallowing means "Do not crawl"... That is MUCH different than "Do not index". Pages that have been disallowed can still show in Google's index and search results if enough other sites link to it and the link text makes Google feel it is a relevant result. The best way to prevent indexing is <meta name="robots" content="noindex">.
Just wanted to follow this one up. We had the correct answer, but we want to be sure, we have the whole answer. If you stop google from indexing and following a page, in robot text, that's not enough. You need to Add No Follow Meta Tags on the pages themselves to completely stop any PR juice loss from the pages that are linking to it.
Disallow in the robots.txt file (Google caches the robots.txt and updates it about every 24 hours, so be sure to add it a day before you make the content live). If it is already live, also add the NOINDEX and NOFOLLOW to the robots meta tag. This will keep googlebot from indexing AND following the links on these pages. <meta name="robots" content="NOINDEX,NOFOLLOW">.