extract text from a pdf file

moove Peon

Messages:: 5

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#1

i am trying to extract the text content form a pdf file to count the words in that pdf files.but i couldn't find a way can anyone give some suggestions to me to complete this.

thanks

moove, Feb 22, 2007 IP

JZY Peon

Messages:: 36

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#2

Use Text Select Tool (V) (Acrobat Reader) to enable Text selection - then use copy&paste to another application (MSWord), MSWword can calculate words results

P.S.
from Acrobat Reader 5 there are encrypted pdf files supporting, in that case copy&paste could not work, and
may be your PHP script will not work ok with encrypted files. To learn is the file encrypted, view it source code
not encrypted will show
%PDF-1.4
%Ã¢Ã£ÃÃ“
2958 0 obj <</Linearized 1/L 969089/O 2961/E 175711/N 20/T 909880/H [ 1081 1170]>>
endobj

xref
2958 38
0000000016 00000 n

JZY, Feb 22, 2007 IP

moove Peon

Messages:: 5

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#3

thanks for replying me,

my requirement is to count the words while user uploading the pdf file.

it tried the following code from php.net

<?php

$text = pdf2string("file.pdf");
echo $text;

function pdf2string($sourcefile){
$fp = fopen($sourcefile, 'rb');
$content = fread($fp, filesize($sourcefile));
fclose($fp);

$searchstart = 'stream';
$searchend = 'endstream';
$pdfdocument = '';
$pos = 0;
$pos2 = 0;
$startpos = 0;

while( $pos !== false && $pos2 !== false ){
$pos = strpos($content, $searchstart, $startpos);
$pos2 = strpos($content, $searchend, $startpos + 1);

if ($pos !== false && $pos2 !== false){
if ($content[$pos]==0x0d && $content[$pos+1]==0x0a) $pos+=2;
else if ($content[$pos]==0x0a) $pos++;

if ($content[$pos2-2]==0x0d && $content[$pos2-1]==0x0a) $pos2-=2;
else if ($content[$pos2-1]==0x0a) $pos2--;

$textsection = substr($content, $pos + strlen($searchstart) + 2, $pos2 - $pos - strlen($searchstart) - 1);
$data = @gzuncompress($textsection);
$data = ExtractText2($data);
$startpos = $pos2 + strlen($searchend) - 1;

if ($data === false){
return -1;}

$pdfdocument .= $data;}}
return $pdfdocument;}

function ExtractText2($postScriptData){
$sw = true;
$textStart = 0;
$len = strlen($postScriptData);

while ($sw){
$ini = strpos($postScriptData, '(', $textStart);
$end = strpos($postScriptData, ')', $textStart+1);
if (($ini>0) && ($end>$ini)){
$valtext = strpos($postScriptData,'Tj',$end+1);
if ($valtext == $end + 2)
$text .= substr($postScriptData,$ini+1,$end - $ini - 1);}

$textStart = $end + 1;
if ($len<=$textStart) $sw=false;

if (($ini == 0) && ($end == 0)) $sw=false;}

$trans = array("\\341" => "a","\\351" => "e","\\355" => "i","\\363" => "o","\\223" => "","\\224" => "");
$text = strtr($text, $trans);
return $text;
}
?>

but it shows some error

in the "gzuncompress" function.

i am using xampp in windows server.

let me know is there any solution for that

moove, Feb 23, 2007 IP

Log in or Sign up

extract text from a pdf file

moove Peon

JZY Peon

moove Peon

Useful Searches