If not convert, then I need at the very least to find a way to have PHP read all the text in these documents. I've been looking for some kind of simple I/O API to convert some or all of these files into HTML to be read and indexed for my latest project. I need members to be able to search within documents I've uploaded in these formats. I see Google Docs has an API, but the documentation is quite limited IMO. I don't need a one-api-fits-all-filetype solution to this, but I do need a solution for all these formats, even if its one API for each file type. I've thought about waiting for Google to cache the document and then access the "view in HTML" cached version but it'd be nice to have it independent of all that nonsense. For this project, automation is key! There could potentially be hundreds (or even thousands) of documents a day flowing through the site, and it cannot be done manually. Any Suggestions?
Adobe has an online PDF to HTML tool, you could probably automate it... http://www.adobe.com/products/acrobat/access_onlinetools.html
Thanks, but I think I have PDF covered - http://www.setasign.de/products/pdf-php-solutions/fpdi/demos/simple-demo/ The office 03 and office 07 formats are the one's I'm most concerned about. I know I can find the source of the office text by unzipping the docx (or xlsx, etc) but its tough to eliminate all the unnecessary styling / formatting XML code. Realizing this just now, I think I have at least those under control with PHP's zip_open()..! What is the format for the old Office extensions?