i am trying to extract the text content form a pdf file to count the words in that pdf files.but i couldn't find a way can anyone give some suggestions to me to complete this. thanks
Use Text Select Tool (V) (Acrobat Reader) to enable Text selection - then use copy&paste to another application (MSWord), MSWword can calculate words results P.S. from Acrobat Reader 5 there are encrypted pdf files supporting, in that case copy&paste could not work, and may be your PHP script will not work ok with encrypted files. To learn is the file encrypted, view it source code not encrypted will show %PDF-1.4 %âãÃÓ 2958 0 obj <</Linearized 1/L 969089/O 2961/E 175711/N 20/T 909880/H [ 1081 1170]>> endobj xref 2958 38 0000000016 00000 n
thanks for replying me, my requirement is to count the words while user uploading the pdf file. it tried the following code from php.net <?php $text = pdf2string("file.pdf"); echo $text; function pdf2string($sourcefile){ $fp = fopen($sourcefile, 'rb'); $content = fread($fp, filesize($sourcefile)); fclose($fp); $searchstart = 'stream'; $searchend = 'endstream'; $pdfdocument = ''; $pos = 0; $pos2 = 0; $startpos = 0; while( $pos !== false && $pos2 !== false ){ $pos = strpos($content, $searchstart, $startpos); $pos2 = strpos($content, $searchend, $startpos + 1); if ($pos !== false && $pos2 !== false){ if ($content[$pos]==0x0d && $content[$pos+1]==0x0a) $pos+=2; else if ($content[$pos]==0x0a) $pos++; if ($content[$pos2-2]==0x0d && $content[$pos2-1]==0x0a) $pos2-=2; else if ($content[$pos2-1]==0x0a) $pos2--; $textsection = substr($content, $pos + strlen($searchstart) + 2, $pos2 - $pos - strlen($searchstart) - 1); $data = @gzuncompress($textsection); $data = ExtractText2($data); $startpos = $pos2 + strlen($searchend) - 1; if ($data === false){ return -1;} $pdfdocument .= $data;}} return $pdfdocument;} function ExtractText2($postScriptData){ $sw = true; $textStart = 0; $len = strlen($postScriptData); while ($sw){ $ini = strpos($postScriptData, '(', $textStart); $end = strpos($postScriptData, ')', $textStart+1); if (($ini>0) && ($end>$ini)){ $valtext = strpos($postScriptData,'Tj',$end+1); if ($valtext == $end + 2) $text .= substr($postScriptData,$ini+1,$end - $ini - 1);} $textStart = $end + 1; if ($len<=$textStart) $sw=false; if (($ini == 0) && ($end == 0)) $sw=false;} $trans = array("\\341" => "a","\\351" => "e","\\355" => "i","\\363" => "o","\\223" => "","\\224" => ""); $text = strtr($text, $trans); return $text; } ?> but it shows some error in the "gzuncompress" function. i am using xampp in windows server. let me know is there any solution for that