Hi guys, Currently am doing a project in text extraction. I have written the program to extract the source code of a web page. But now i do not know how to retrieve the text part alone without the html tags. Is there any solution for this? I need some urgent help guys. You can also contact me via email. I am doing it in java. I know that i have got a method in C# to strip off the html tags and regain only the text. But i do not know to work with C#. Actually my front end is going to be designed in php. Right now i have the program in java to download the source code of a web page. So i need to proceed from here.
Wow. Its great dude. Thank you so much. I have a doubt. Can this same code be re-written in javascript? Because java is not compatable with php.
Oh, you mentioned java and that is why i pointed you there.. you meant javascript.. function stripHTML( text ) { var re= /<\S[^><]*>/g return text.replace(re, ""); } Code (markup): altered from http://www.javascriptkit.com/script/script2/removehtml.shtml or if you prefer in PHP <?php $text = '<p>Test paragraph.</p><!-- Comment --> <a href="#fragment">Other text</a>'; echo strip_tags($text); echo "\n"; // Allow <p> and <a> echo strip_tags($text, '<p><a>'); ?> PHP: quoted from http://gr2.php.net/strip-tags hope this helps
Hello all, Am back with a problem in one of the codes. The java coding is working fine to eliminate the html tags alone. But i have got a problem when am supposed to extract the contents of any web page developed on any language other than HTML. http://bottle-opener.info/?p=9 In my output file, i get unnecessary data such as the coding part like ## or any thing it may be. Say for example, when i specify google, it is built on ajax. The coding is entirely diferrent and displays unwanted part in my output. My intention is to extract only the text data. So is there any way?
I am not sure i understand what you say ... No matter what language is used to develop a webpage, it always ends being served to the client as HTML.. So the final webpage which you have access to is always in HTML, because that is what the browsers can understand.. If you are talking about javascripts embedded in a page, they too are contained inside the script tag <script></script> so they are treated also as html regards