Regex to extract body content

amorph Peon

Messages:: 200

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#1

Anyone with a good regex to extract the content of a page body? Anything between <body> and </body>

Thanks.

amorph, Jun 28, 2007 IP

ansi Well-Known Member

Messages:: 1,483

Likes Received:: 65

Best Answers:: 0

Trophy Points:: 100

#2

/<body>(.*)<\/body>/im

ansi, Jun 28, 2007 IP

amorph Peon

Messages:: 200

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#3

hmmm...

so why is this not working?


$string = '<body>sad asd asd asd asdasd asdasd</body>';
	if (preg_match('/<body>(.*)<\/body>/im', $string, $regs)) {
	$result = $regs[1];
	} else {
		$result = "";
	}

PHP:

amorph, Jun 28, 2007 IP

ansi Well-Known Member

Messages:: 1,483

Likes Received:: 65

Best Answers:: 0

Trophy Points:: 100

#4

because you're not outputting anything?


<?
$string = '<body>sad asd asd asd asdasd asdasd</body>';
if (preg_match('/<body>(.*)<\/body>/im', $string, $regs)) {
	$result = $regs[1];
} else {
	$result = "";
}
echo $result;
?>

PHP:

ansi, Jun 28, 2007 IP

amorph Peon

Messages:: 200

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#5

right...I feel stupid

amorph, Jun 28, 2007 IP

amorph Peon

Messages:: 200

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#6

Ok it was working with my example above but when confrunted with a true page source...it fails to output anything. Here's a sample:
$string = '<body>
asdasd
asdasd
</body>
';
	if (preg_match('/<body>(.*)<\/body>/im', $string, $regs)) {
	$result = $regs[1];
	} else {
		$result = "";
	}	
	echo $result;
PHP:
I guess we're missing tabs, new lines, line breaks etc.

amorph, Jun 28, 2007 IP

ansi Well-Known Member

Messages:: 1,483

Likes Received:: 65

Best Answers:: 0

Trophy Points:: 100

#7

well, you have 2 choices. you can strip the newlines and use what you have. or you can use preg_match_all instead.
<?
	$string = '<body>
	asdasd
	asdasd
	</body>
	';

	if (preg_match_all('/<body>(.*)<\/body>/ism', $string, $regs))
	{
		$result = $regs[1][0];
	} 
	else
	{
		$result = "nadda";
	}   
	echo $result;
?>
PHP:

ansi, Jun 28, 2007 IP

nico_swd Prominent Member

Messages:: 4,153

Likes Received:: 344

Best Answers:: 18

Trophy Points:: 375

#8

No need for preg_match_all(), just a little fix in the pattern does it as well:
'/<body>(.*?)<\/body>/si'
PHP:

nico_swd, Jun 29, 2007 IP

ansi Well-Known Member

Messages:: 1,483

Likes Received:: 65

Best Answers:: 0

Trophy Points:: 100

#9

i tried the s modifier. didnt work for me

edit:
okay well i thought i did anyways but i back tracked and apparently i didnt my bad yo.

ansi, Jun 29, 2007 IP

nico_swd Prominent Member

Messages:: 4,153

Likes Received:: 344

Best Answers:: 18

Trophy Points:: 375

#10

Note the question mark.

nico_swd, Jun 29, 2007 IP

Weizheng Peon

Messages:: 93

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 0

#11

nico_swd said: ↑
No need for preg_match_all(), just a little fix in the pattern does it as well:
'/<body>(.*?)<\/body>/si'
PHP:
Click to expand...
I think there's still room for improvement

Try parsing
$string = '<body style="color:#fff">
    asdasd
    asdasd
    </body>
    ';
PHP:

Weizheng, Jun 29, 2007 IP

nico_swd Prominent Member

Messages:: 4,153

Likes Received:: 344

Best Answers:: 18

Trophy Points:: 375

#12

Okay okay, lol.
'/<body[^>]*>(.*?)<\/body>/si'
PHP:
This should do it.

nico_swd, Jun 29, 2007 IP

rodney88 Guest

Messages:: 480

Likes Received:: 37

Best Answers:: 0

Trophy Points:: 0

#13

A valid HTML document will only have one </body> tag so you could improve it further by using a greedy quantifier. The lazy/ungreedy asterisk will start right after the opening <body> tag, expand one character at a time until it reaches a <, then it checks if the next character is a /, then a b etc. It will have to do that for every closing HTML tag, and then when it finds a character that doesn't match, (i.e. for '</b>' or '</div>') it'll have to backtrack to before that < and start consuming the whole tag again, one character at a time. It has to do this right up to the end (almost) of the document where it finds the </body>.

I dunno how bad it actually is, but it's crashed RegexBuddy attempting that expression just on the page source for this page (admittedly, I'm on a pretty slow computer atm).

Using a greedy asterisk, the dot matches everything and reads right up to the end of the document straight away. It then backtracks one character at a time to find a <, checks if its followed by /, b, (etc.). If it fails, it goes back and continues giving up one character at a time till it finds the </body>

And since we know all valid HTML documents will have a </body> tag, normally only a few characters away from the end of the document, we can save a lot of time by being greedy. Effectively the speed of the lazy asterisk is proportional to the length of the body, whereas the speed of the greedy asterisk is proportional to the length between the </body> and end of document.

rodney88, Jun 29, 2007 IP

nevetS likes this.

amorph Peon

Messages:: 200

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#14

Thank you all!

amorph, Jun 29, 2007 IP

Log in or Sign up

Regex to extract body content

amorph Peon

ansi Well-Known Member

amorph Peon

ansi Well-Known Member

amorph Peon

amorph Peon

ansi Well-Known Member

nico_swd Prominent Member

ansi Well-Known Member

nico_swd Prominent Member

Weizheng Peon

nico_swd Prominent Member

rodney88 Guest

amorph Peon

Useful Searches