text extraction

Discussion in 'Programming' started by ichkoguy, Jan 14, 2009.

  1. #1
    Hi guys,

    Currently am doing a project in text extraction. I have written the program to extract the source code of a web page. But now i do not know how to retrieve the text part alone without the html tags. Is there any solution for this? I need some urgent help guys. You can also contact me via email.

    I am doing it in java. I know that i have got a method in C# to strip off the html tags and regain only the text. But i do not know to work with C#. Actually my front end is going to be designed in php. Right now i have the program in java to download the source code of a web page. So i need to proceed from here.
     
    ichkoguy, Jan 14, 2009 IP
  2. gnp

    gnp Peon

    Messages:
    137
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Take a look at this
    http://www.rgagnon.com/javadetails/java-0424.html
     
    gnp, Jan 15, 2009 IP
    ichkoguy likes this.
  3. ichkoguy

    ichkoguy Active Member

    Messages:
    666
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    60
    #3

    Wow. Its great dude. Thank you so much. I have a doubt. Can this same code be re-written in javascript? Because java is not compatable with php.
     
    ichkoguy, Jan 15, 2009 IP
  4. gnp

    gnp Peon

    Messages:
    137
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Oh,
    you mentioned java and that is why i pointed you there..

    you meant javascript..
    
    function stripHTML( text )
    {
    	var re= /<\S[^><]*>/g
    	return text.replace(re, "");
    }
    Code (markup):
    altered from http://www.javascriptkit.com/script/script2/removehtml.shtml


    or if you prefer in PHP
    
    <?php
    $text = '<p>Test paragraph.</p><!-- Comment --> <a href="#fragment">Other text</a>';
    echo strip_tags($text);
    echo "\n";
    
    // Allow <p> and <a>
    echo strip_tags($text, '<p><a>');
    ?>
    
    PHP:
    quoted from http://gr2.php.net/strip-tags


    hope this helps
     
    gnp, Jan 15, 2009 IP
  5. ichkoguy

    ichkoguy Active Member

    Messages:
    666
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    60
    #5
    Hello all,

    Am back with a problem in one of the codes.

    The java coding is working fine to eliminate the html tags alone.

    But i have got a problem when am supposed to extract the contents of any web page developed on any language other than HTML.

    http://bottle-opener.info/?p=9

    In my output file, i get unnecessary data such as the coding part like ## or any thing it may be.

    Say for example, when i specify google, it is built on ajax. The coding is entirely diferrent and displays unwanted part in my output.

    My intention is to extract only the text data. So is there any way?
     
    ichkoguy, Feb 16, 2009 IP
  6. gnp

    gnp Peon

    Messages:
    137
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #6
    I am not sure i understand what you say ...

    No matter what language is used to develop a webpage, it always ends being served to the client as HTML..

    So the final webpage which you have access to is always in HTML, because that is what the browsers can understand..

    If you are talking about javascripts embedded in a page, they too are contained inside the script tag <script></script> so they are treated also as html

    regards
     
    gnp, Feb 16, 2009 IP