how can i extract all text in html page between the <body> </body> tags ?

ramysarwat Peon

Messages:: 164

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#1

how can i extract all text in html page between the <body> </body> tags ?

ramysarwat, Nov 5, 2009 IP

nico_swd Prominent Member

Messages:: 4,153

Likes Received:: 344

Best Answers:: 18

Trophy Points:: 375

#2


if (preg_match('~<body[^>]*>(.*?)</body>~si', $text, $body))
{
    echo $body[1];
}

PHP:

nico_swd, Nov 5, 2009 IP

ramysarwat Peon

Messages:: 164

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#3

nico_swd said: ↑
if (preg_match('~<body[^>]*>(.*?)</body>~si', $text, $body))
{
    echo $body[1];
}
PHP:
Click to expand...
thank you nico_swd i try this code but never give any output. any idea why ?

<?php
$text = file_get_contents("http://www.google.com/");
if (preg_match('~<body[^>]*>(.*?)</body>~si', $text, $body)){
echo $body[1];
}

?>

ramysarwat, Nov 5, 2009 IP

nico_swd Prominent Member

Messages:: 4,153

Likes Received:: 344

Best Answers:: 18

Trophy Points:: 375

#4

Because Google will redirect you, and file_get_contents() doesn't follow redirects. Try another domain and it'll work.

nico_swd, Nov 5, 2009 IP

ramysarwat Peon

Messages:: 164

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#5

nico_swd said: ↑

Because Google will redirect you, and file_get_contents() doesn't follow redirects. Try another domain and it'll work.
Click to expand...

i try it on 3 other web sites with contents but noting hapen too. any other ideas ?

ramysarwat, Nov 5, 2009 IP

nico_swd Prominent Member

Messages:: 4,153

Likes Received:: 344

Best Answers:: 18

Trophy Points:: 375

#6


$ch = curl_init('http://nicoswd.com/');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$text = curl_exec();

if (preg_match('~<body[^>]*>(.*?)</body>~si', $text, $body))
{
    echo $body[1];
}

PHP:

nico_swd, Nov 5, 2009 IP

ramysarwat Peon

Messages:: 164

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#7

nico_swd said: ↑
$ch = curl_init('http://nicoswd.com/');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$text = curl_exec();

if (preg_match('~<body[^>]*>(.*?)</body>~si', $text, $body))
{
    echo $body[1];
}
PHP:
Click to expand...
i can't belive it the same resault with curl too

when i read the output of curl or file get contents i get the out put but when i use preg_match i get nothing

ramysarwat, Nov 5, 2009 IP

nico_swd Prominent Member

Messages:: 4,153

Likes Received:: 344

Best Answers:: 18

Trophy Points:: 375

#8

Which domains have you tried?

nico_swd, Nov 5, 2009 IP

ramysarwat Peon

Messages:: 164

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#9

nico_swd said: ↑

Which domains have you tried?
Click to expand...

try this:
http://forums.digitalpoint.com/

ramysarwat, Nov 5, 2009 IP

mony911 Peon

Messages:: 114

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#10

try this.. this will work...

<?php
$txt = file_get_contents($url);

$arr = get_tag($txt, "body");

print_r($arr);

function get_tag($txt,$tag){
$offset = 0;
$start_tag = "<".$tag;
$end_tag = "</".$tag.">";
$arr = array();
do{
$pos = strpos($txt,$start_tag,$offset);
if($pos){
$str_pos = strpos($txt,">",$pos)+1;
$end_pos = strpos($txt,$end_tag,$str_pos);
$len = $end_pos - $str_pos;
$f_text = substr($txt,$str_pos,$len);

$arr[] = $f_text;
$offset = $end_pos;
}
}while($pos);
return $arr;

}
?>
Click to expand...

this is written by Bony Yousuf.. original post is here..

http://www.sitepoint.com/forums/showthread.php?t=643722

mony911, Nov 5, 2009 IP

unigogo Peon

Messages:: 286

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 0

#11

remove carriage returns
$str = preg_replace("/\r/", $html, "\s");

retrieve html between body tags
preg_match("/<\s*body.*>.*/", $str, $body);

$result = preg_split("/<(.|\n)*?>/", $body);

I tried steps here,
http://www.pagecolumn.com/tool/pregtest.htm

Last edited: Nov 5, 2009

unigogo, Nov 5, 2009 IP

Izonedig Member

Messages:: 150

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 28

#12

use the html dom, you can then get content of any part in the html document.
http://simplehtmldom.sourceforge.net/

Izonedig, Feb 17, 2010 IP

danx10 Peon

Messages:: 1,179

Likes Received:: 44

Best Answers:: 2

Trophy Points:: 0

#13

Make sure the actual site has a body tag.

<?php

$site = file_get_contents("http://en.wikipedia.org/wiki/Benchmark");

preg_match("/<body[^>]*>(.*?)<\/body>/is", $site, $matches);

highlight_string($matches[1]);

?>

PHP:

Another example....

<?php

$site = file_get_contents("http://www.google.com/codesearch");

preg_match("/<body[^>]*>(.*?)<\/body>/is", $site, $matches);

highlight_string($matches[1]);

?>

PHP:

danx10, Feb 17, 2010 IP

Log in or Sign up

how can i extract all text in html page between the <body> </body> tags ?

ramysarwat Peon

nico_swd Prominent Member

ramysarwat Peon

nico_swd Prominent Member

ramysarwat Peon

nico_swd Prominent Member

ramysarwat Peon

nico_swd Prominent Member

ramysarwat Peon

mony911 Peon

unigogo Peon

Izonedig Member

danx10 Peon

Useful Searches