text extraction php script and functions

Discussion in 'Programming' started by ichkoguy, Jan 19, 2009.

  1. #1
    My project scenario is this: I am downloading the contents of a web page. So my output will be the source code of the specified file.

    My next step is to retrieve the text part alone from that. Hence i thought of using stip_tags(). As this will eliminate all the tags and give only the text.

    But i have a doubt here. For stripping off the tags, we use variable and then use that variable inside the function.

    Now that i wanted to give the ouput of my downloading the source page straight away to the strip-tags and finally get only the extracted text as the output in my window. How should i proceed now? Can you please help me.

    And also i have the java code to download the source code of a web page. I need a php code to download the source code.

    Thanks.
     
    ichkoguy, Jan 19, 2009 IP
  2. gnp

    gnp Peon

    Messages:
    137
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #2
    You can take a look at this
    
    http://gr2.php.net/fopen
    and 
    http://www.php.net/manual/en/features.remote-files.php
    
    Code (markup):
    (about the fopen command)

    take care
     
    gnp, Jan 19, 2009 IP
  3. thegetpr

    thegetpr Banned

    Messages:
    99
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #3
    you can use curl to fetch any thing from any where through curl.
     
    thegetpr, Jan 19, 2009 IP
  4. gnp

    gnp Peon

    Messages:
    137
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Alternatively,

    to directly get the contents in one go, look at
    http://gr2.php.net/manual/en/function.file-get-contents.php
    Code (markup):
     
    gnp, Jan 19, 2009 IP
  5. NinjaWork

    NinjaWork Guest

    Messages:
    132
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #5
    I saw this great script to download pages and get the text:

    http://ubuntuforums.org/showpost.php?p=4782850&postcount=880

    it is in perl, but all you really need is to use the "wget" and "lynx --dump" combo ;-)

    see php's system command
     
    NinjaWork, Jan 19, 2009 IP