extract text from a pdf file

Discussion in 'PHP' started by moove, Feb 22, 2007.

  1. #1
    i am trying to extract the text content form a pdf file to count the words in that pdf files.but i couldn't find a way can anyone give some suggestions to me to complete this.

    thanks
     
    moove, Feb 22, 2007 IP
  2. JZY

    JZY Peon

    Messages:
    36
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Use Text Select Tool (V) (Acrobat Reader) to enable Text selection - then use copy&paste to another application (MSWord), MSWword can calculate words results

    P.S.
    from Acrobat Reader 5 there are encrypted pdf files supporting, in that case copy&paste could not work, and
    may be your PHP script will not work ok with encrypted files. To learn is the file encrypted, view it source code
    not encrypted will show
    %PDF-1.4
    %âãÏÓ
    2958 0 obj <</Linearized 1/L 969089/O 2961/E 175711/N 20/T 909880/H [ 1081 1170]>>
    endobj

    xref
    2958 38
    0000000016 00000 n
     
    JZY, Feb 22, 2007 IP
  3. moove

    moove Peon

    Messages:
    5
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    thanks for replying me,

    my requirement is to count the words while user uploading the pdf file.

    it tried the following code from php.net


    <?php

    $text = pdf2string("file.pdf");
    echo $text;

    function pdf2string($sourcefile){
    $fp = fopen($sourcefile, 'rb');
    $content = fread($fp, filesize($sourcefile));
    fclose($fp);

    $searchstart = 'stream';
    $searchend = 'endstream';
    $pdfdocument = '';
    $pos = 0;
    $pos2 = 0;
    $startpos = 0;

    while( $pos !== false && $pos2 !== false ){
    $pos = strpos($content, $searchstart, $startpos);
    $pos2 = strpos($content, $searchend, $startpos + 1);

    if ($pos !== false && $pos2 !== false){
    if ($content[$pos]==0x0d && $content[$pos+1]==0x0a) $pos+=2;
    else if ($content[$pos]==0x0a) $pos++;

    if ($content[$pos2-2]==0x0d && $content[$pos2-1]==0x0a) $pos2-=2;
    else if ($content[$pos2-1]==0x0a) $pos2--;

    $textsection = substr($content, $pos + strlen($searchstart) + 2, $pos2 - $pos - strlen($searchstart) - 1);
    $data = @gzuncompress($textsection);
    $data = ExtractText2($data);
    $startpos = $pos2 + strlen($searchend) - 1;

    if ($data === false){
    return -1;}

    $pdfdocument .= $data;}}
    return $pdfdocument;}

    function ExtractText2($postScriptData){
    $sw = true;
    $textStart = 0;
    $len = strlen($postScriptData);

    while ($sw){
    $ini = strpos($postScriptData, '(', $textStart);
    $end = strpos($postScriptData, ')', $textStart+1);
    if (($ini>0) && ($end>$ini)){
    $valtext = strpos($postScriptData,'Tj',$end+1);
    if ($valtext == $end + 2)
    $text .= substr($postScriptData,$ini+1,$end - $ini - 1);}

    $textStart = $end + 1;
    if ($len<=$textStart) $sw=false;

    if (($ini == 0) && ($end == 0)) $sw=false;}

    $trans = array("\\341" => "a","\\351" => "e","\\355" => "i","\\363" => "o","\\223" => "","\\224" => "");
    $text = strtr($text, $trans);
    return $text;
    }
    ?>


    but it shows some error


    in the "gzuncompress" function.

    i am using xampp in windows server.

    let me know is there any solution for that
     
    moove, Feb 23, 2007 IP