Hi, My site has several PDF documents that users can download. However, I would like to be able to index the content of those pdf files, so that the users could do a search for a given argument, and the site would return which pdf files would be relevant. I was thinking that maybe the best way of doing this would be by parsing the content of the pdf files and save it on a MySQL table. When the user would do the search, the script would look in the table and return the pdf file names relevant for the search. I would need some guidance on how I could parse a PDF file since I've never done this before. Also, would this be the best way of achieving what I want? Thanks! Rfeio
You definitely have the right idea - searching will be much faster if you extract the text in advance to somewhere it can be indexed (e.g. MySQL FULLTEXT index). Use pdftotext which is part of the xpdf distribution. PDF documents are quite complex so there's little point in reinventing the wheel. http://www.foolabs.com/xpdf/about.html
use a google hack to find the document "search query" domain:mysite.com filetype:pdf PHP: that's assuming your PDF's are indexed in google (likely), if not, you might have to do something different like use sphider to build a custom engine for your site.