There are times when it is useful to provide two or more separate views of information in order to make things easier on the end user - for instance, if you have a large help file, you might want an all-in-one-page html view, a multi-page html view, a pdf version, a flashpaper version, and a text file. For the html views, you might want a cool ajax interface, a minimal javascripted navigation interface, and an accessible version. The list goes on. In these situations, is there a generally accepted consensus as to what the best practices are regarding search engine access? Or is there an existing thread somewhere where this has been debated? Or heaven forbid have any of the SEs provided a guideline? Just thinking here - the Google Sitemaps protocol has a priority so you might want to increase the priority of the pages in the multi-page html document and decrease the priority of all others, but that seems a little off and it's only a one search engine solution. Maybe place alternative views of documents inside no-index directory? That works if you only have a couple, but if your document repository is large, then it becomes unreasonably difficult to manage and to the end user finding documentation in the desired format is not intuitive. Another alternative would be to include noindex, nofollow tags, but then you run into the problem for documentation that is not wrapped in an html page. It would work for flashpaper, but not pdf, chm, a text file, or even a word document. Another solution would be javascript built links to alternative formats that would hide the link from the SE, but that would mean the links wouldn't be visible to browsers that don't support javascript or users who don't have it enabled and again the links would be difficult to discern for the end user. And the last solution I can think of would be a download form that had a link to the multi-page html version and you had to select a format and hit download to get other versions, but the solution is not elegant at all and users could not directly link to the documentation which might be good for google, but it incurs a potential annoyance to the end user community.
The best solution is to just let the duplicate content filters do the job they are intended to do. Google, Yahoo and MSN have filters that find the best content for a particular search query and display that to the searcher. The best thing is to forget about the search engines and let them choose what content to show your visitors. If you have got the structure of your site right then this will be the standard html version with maybe an indented result for the pdf version.
I actually had a chance to talk with a google rep and a few other people during the SES conference. I've been told the following: Different formats of the same document are not so much of a worry - a pdf/word/and html document covering the same material are fine. From a sitemap perspective, the html document should be given a higher priority than the others. Encourage people to link to the html version rather than other formats. Only link to the alternate formats from the html version. link rel=alternate is encouraged. ideally, put alternate formats in an unspiderable directory. A lot of common sense, but it was good to talk to some pros who have dealt with this kind of issue before.