PDF parser

Discussion in 'PHP' started by rfeio, Apr 2, 2009.

  1. #1
    Hi,

    My site has several PDF documents that users can download. However, I would like to be able to index the content of those pdf files, so that the users could do a search for a given argument, and the site would return which pdf files would be relevant.

    I was thinking that maybe the best way of doing this would be by parsing the content of the pdf files and save it on a MySQL table. When the user would do the search, the script would look in the table and return the pdf file names relevant for the search.

    I would need some guidance on how I could parse a PDF file since I've never done this before. Also, would this be the best way of achieving what I want?

    Thanks!

    Rfeio
     
    rfeio, Apr 2, 2009 IP
  2. SmallPotatoes

    SmallPotatoes Peon

    Messages:
    1,321
    Likes Received:
    41
    Best Answers:
    0
    Trophy Points:
    0
    #2
    You definitely have the right idea - searching will be much faster if you extract the text in advance to somewhere it can be indexed (e.g. MySQL FULLTEXT index).

    Use pdftotext which is part of the xpdf distribution. PDF documents are quite complex so there's little point in reinventing the wheel.

    http://www.foolabs.com/xpdf/about.html
     
    SmallPotatoes, Apr 2, 2009 IP
  3. fourfingers

    fourfingers Peon

    Messages:
    37
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #3
    use a google hack to find the document

    
    "search query" domain:mysite.com filetype:pdf
    
    PHP:
    that's assuming your PDF's are indexed in google (likely), if not, you might have to do something different like use sphider to build a custom engine for your site.
     
    fourfingers, Apr 5, 2009 IP