Convert PDF, DOCX, XLSX, PPTX, PPT, DOC, XLS, PUB, ODP, OPS, OPT, ODF to HTML in PHP

Discussion in 'PHP' started by tguillea, Apr 24, 2010.

  1. #1
    If not convert, then I need at the very least to find a way to have PHP read all the text in these documents.

    I've been looking for some kind of simple I/O API to convert some or all of these files into HTML to be read and indexed for my latest project.

    I need members to be able to search within documents I've uploaded in these formats. I see Google Docs has an API, but the documentation is quite limited IMO.

    I don't need a one-api-fits-all-filetype solution to this, but I do need a solution for all these formats, even if its one API for each file type. I've thought about waiting for Google to cache the document and then access the "view in HTML" cached version but it'd be nice to have it independent of all that nonsense.

    For this project, automation is key! There could potentially be hundreds (or even thousands) of documents a day flowing through the site, and it cannot be done manually.

    Any Suggestions?
     
    tguillea, Apr 24, 2010 IP
  2. Brad33

    Brad33 Peon

    Messages:
    69
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Brad33, Apr 24, 2010 IP
  3. tguillea

    tguillea Active Member

    Messages:
    229
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    90
    #3
    Thanks, but I think I have PDF covered - http://www.setasign.de/products/pdf-php-solutions/fpdi/demos/simple-demo/

    The office 03 and office 07 formats are the one's I'm most concerned about. I know I can find the source of the office text by unzipping the docx (or xlsx, etc) but its tough to eliminate all the unnecessary styling / formatting XML code.

    Realizing this just now, I think I have at least those under control with PHP's zip_open()..!

    What is the format for the old Office extensions?
     
    tguillea, Apr 25, 2010 IP