Anyone with a good regex to extract the content of a page body? Anything between <body> and </body> Thanks.
hmmm... so why is this not working? $string = '<body>sad asd asd asd asdasd asdasd</body>'; if (preg_match('/<body>(.*)<\/body>/im', $string, $regs)) { $result = $regs[1]; } else { $result = ""; } PHP:
because you're not outputting anything? <? $string = '<body>sad asd asd asd asdasd asdasd</body>'; if (preg_match('/<body>(.*)<\/body>/im', $string, $regs)) { $result = $regs[1]; } else { $result = ""; } echo $result; ?> PHP:
Ok it was working with my example above but when confrunted with a true page source...it fails to output anything. Here's a sample: $string = '<body> asdasd asdasd </body> '; if (preg_match('/<body>(.*)<\/body>/im', $string, $regs)) { $result = $regs[1]; } else { $result = ""; } echo $result; PHP: I guess we're missing tabs, new lines, line breaks etc.
well, you have 2 choices. you can strip the newlines and use what you have. or you can use preg_match_all instead. <? $string = '<body> asdasd asdasd </body> '; if (preg_match_all('/<body>(.*)<\/body>/ism', $string, $regs)) { $result = $regs[1][0]; } else { $result = "nadda"; } echo $result; ?> PHP:
No need for preg_match_all(), just a little fix in the pattern does it as well: '/<body>(.*?)<\/body>/si' PHP:
i tried the s modifier. didnt work for me edit: okay well i thought i did anyways but i back tracked and apparently i didnt my bad yo.
I think there's still room for improvement Try parsing $string = '<body style="color:#fff"> asdasd asdasd </body> '; PHP:
A valid HTML document will only have one </body> tag so you could improve it further by using a greedy quantifier. The lazy/ungreedy asterisk will start right after the opening <body> tag, expand one character at a time until it reaches a <, then it checks if the next character is a /, then a b etc. It will have to do that for every closing HTML tag, and then when it finds a character that doesn't match, (i.e. for '</b>' or '</div>') it'll have to backtrack to before that < and start consuming the whole tag again, one character at a time. It has to do this right up to the end (almost) of the document where it finds the </body>. I dunno how bad it actually is, but it's crashed RegexBuddy attempting that expression just on the page source for this page (admittedly, I'm on a pretty slow computer atm). Using a greedy asterisk, the dot matches everything and reads right up to the end of the document straight away. It then backtracks one character at a time to find a <, checks if its followed by /, b, (etc.). If it fails, it goes back and continues giving up one character at a time till it finds the </body> And since we know all valid HTML documents will have a </body> tag, normally only a few characters away from the end of the document, we can save a lot of time by being greedy. Effectively the speed of the lazy asterisk is proportional to the length of the body, whereas the speed of the greedy asterisk is proportional to the length between the </body> and end of document.