Convert PDF, DOCX, XLSX, PPTX, PPT, DOC, XLS, PUB, ODP, OPS, OPT, ODF to HTML in PHP

tguillea Active Member

Messages:: 229

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 90

#1

If not convert, then I need at the very least to find a way to have PHP read all the text in these documents.

I've been looking for some kind of simple I/O API to convert some or all of these files into HTML to be read and indexed for my latest project.

I need members to be able to search within documents I've uploaded in these formats. I see Google Docs has an API, but the documentation is quite limited IMO.

I don't need a one-api-fits-all-filetype solution to this, but I do need a solution for all these formats, even if its one API for each file type. I've thought about waiting for Google to cache the document and then access the "view in HTML" cached version but it'd be nice to have it independent of all that nonsense.

For this project, automation is key! There could potentially be hundreds (or even thousands) of documents a day flowing through the site, and it cannot be done manually.

Any Suggestions?

tguillea, Apr 24, 2010 IP

Brad33 Peon

Messages:: 69

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#2

Adobe has an online PDF to HTML tool, you could probably automate it...

http://www.adobe.com/products/acrobat/access_onlinetools.html

Brad33, Apr 24, 2010 IP

tguillea Active Member

Messages:: 229

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 90

#3

Thanks, but I think I have PDF covered - http://www.setasign.de/products/pdf-php-solutions/fpdi/demos/simple-demo/

The office 03 and office 07 formats are the one's I'm most concerned about. I know I can find the source of the office text by unzipping the docx (or xlsx, etc) but its tough to eliminate all the unnecessary styling / formatting XML code.

Realizing this just now, I think I have at least those under control with PHP's zip_open()..!

What is the format for the old Office extensions?

tguillea, Apr 25, 2010 IP

Log in or Sign up

Convert PDF, DOCX, XLSX, PPTX, PPT, DOC, XLS, PUB, ODP, OPS, OPT, ODF to HTML in PHP

tguillea Active Member

Brad33 Peon

tguillea Active Member

Useful Searches