Hi, I have an affiliate store that is part of my site that, if I let Google bot crawl it, will have thousands of pages that are very similar and Google will probably never cache most of them. Is it better to just set in robots.txt for google to not crawl that portion of my website? I know that I won't rank for anything in my store if I do that, but at least Google won't think of my site as complete spam? What do you think.
It is indeed sad that we have to take precautionary measures against a search engine indexing ... but you are correct in that once you get a dup penalty it's difficult to come out. If you're flagged as spammy there may be the matter of a reinclusion request. Easier all around to ban googlebot via robots.txt or meta tags. Ban indexing, ban caches, and setup a honeypot for the scrapers who could cause you problems by reproducing those banned pages.