PDF parser

rfeio Peon

Messages:: 12

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#1

Hi,

My site has several PDF documents that users can download. However, I would like to be able to index the content of those pdf files, so that the users could do a search for a given argument, and the site would return which pdf files would be relevant.

I was thinking that maybe the best way of doing this would be by parsing the content of the pdf files and save it on a MySQL table. When the user would do the search, the script would look in the table and return the pdf file names relevant for the search.

I would need some guidance on how I could parse a PDF file since I've never done this before. Also, would this be the best way of achieving what I want?

Thanks!

Rfeio

rfeio, Apr 2, 2009 IP

SmallPotatoes Peon

Messages:: 1,321

Likes Received:: 41

Best Answers:: 0

Trophy Points:: 0

#2

You definitely have the right idea - searching will be much faster if you extract the text in advance to somewhere it can be indexed (e.g. MySQL FULLTEXT index).

Use pdftotext which is part of the xpdf distribution. PDF documents are quite complex so there's little point in reinventing the wheel.

http://www.foolabs.com/xpdf/about.html

SmallPotatoes, Apr 2, 2009 IP

fourfingers Peon

Messages:: 37

Likes Received:: 3

Best Answers:: 0

Trophy Points:: 0

#3

use a google hack to find the document
"search query" domain:mysite.com filetype:pdf
PHP:
that's assuming your PDF's are indexed in google (likely), if not, you might have to do something different like use sphider to build a custom engine for your site.

fourfingers, Apr 5, 2009 IP

Log in or Sign up

PDF parser

rfeio Peon

SmallPotatoes Peon

fourfingers Peon

Useful Searches