Regex to extract body content

Discussion in 'PHP' started by amorph, Jun 28, 2007.

  1. #1
    Anyone with a good regex to extract the content of a page body? Anything between <body> and </body>

    Thanks.
     
    amorph, Jun 28, 2007 IP
  2. ansi

    ansi Well-Known Member

    Messages:
    1,483
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    100
    #2
    /<body>(.*)<\/body>/im
     
    ansi, Jun 28, 2007 IP
  3. amorph

    amorph Peon

    Messages:
    200
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #3
    hmmm...

    so why is this not working?
    
    $string = '<body>sad asd asd asd asdasd asdasd</body>';
    	if (preg_match('/<body>(.*)<\/body>/im', $string, $regs)) {
    	$result = $regs[1];
    	} else {
    		$result = "";
    	}
    PHP:
     
    amorph, Jun 28, 2007 IP
  4. ansi

    ansi Well-Known Member

    Messages:
    1,483
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    100
    #4
    because you're not outputting anything?

    
    <?
    $string = '<body>sad asd asd asd asdasd asdasd</body>';
    if (preg_match('/<body>(.*)<\/body>/im', $string, $regs)) {
    	$result = $regs[1];
    } else {
    	$result = "";
    }
    echo $result;
    ?>
    
    PHP:
     
    ansi, Jun 28, 2007 IP
  5. amorph

    amorph Peon

    Messages:
    200
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #5
    right...I feel stupid:)
     
    amorph, Jun 28, 2007 IP
  6. amorph

    amorph Peon

    Messages:
    200
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Ok it was working with my example above but when confrunted with a true page source...it fails to output anything. Here's a sample:

    $string = '<body>
    asdasd
    asdasd
    </body>
    ';
    	if (preg_match('/<body>(.*)<\/body>/im', $string, $regs)) {
    	$result = $regs[1];
    	} else {
    		$result = "";
    	}	
    	echo $result;
    PHP:
    I guess we're missing tabs, new lines, line breaks etc.
     
    amorph, Jun 28, 2007 IP
  7. ansi

    ansi Well-Known Member

    Messages:
    1,483
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    100
    #7
    well, you have 2 choices. you can strip the newlines and use what you have. or you can use preg_match_all instead.

    
    <?
    	$string = '<body>
    	asdasd
    	asdasd
    	</body>
    	';
    
    	if (preg_match_all('/<body>(.*)<\/body>/ism', $string, $regs))
    	{
    		$result = $regs[1][0];
    	} 
    	else
    	{
    		$result = "nadda";
    	}   
    	echo $result;
    ?>
    
    PHP:
     
    ansi, Jun 28, 2007 IP
  8. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #8
    No need for preg_match_all(), just a little fix in the pattern does it as well:
    
    '/<body>(.*?)<\/body>/si'
    
    PHP:
     
    nico_swd, Jun 29, 2007 IP
  9. ansi

    ansi Well-Known Member

    Messages:
    1,483
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    100
    #9
    i tried the s modifier. didnt work for me :(

    edit:
    okay well i thought i did anyways but i back tracked and apparently i didnt :) my bad yo.
     
    ansi, Jun 29, 2007 IP
  10. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #10
    Note the question mark. :)
     
    nico_swd, Jun 29, 2007 IP
  11. Weizheng

    Weizheng Peon

    Messages:
    93
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #11
    I think there's still room for improvement

    Try parsing :)

    $string = '<body style="color:#fff">
        asdasd
        asdasd
        </body>
        ';
    PHP:
     
    Weizheng, Jun 29, 2007 IP
  12. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #12
    Okay okay, lol.

    
    '/<body[^>]*>(.*?)<\/body>/si'
    
    PHP:
    This should do it.
     
    nico_swd, Jun 29, 2007 IP
  13. rodney88

    rodney88 Guest

    Messages:
    480
    Likes Received:
    37
    Best Answers:
    0
    Trophy Points:
    0
    #13
    A valid HTML document will only have one </body> tag so you could improve it further by using a greedy quantifier. The lazy/ungreedy asterisk will start right after the opening <body> tag, expand one character at a time until it reaches a <, then it checks if the next character is a /, then a b etc. It will have to do that for every closing HTML tag, and then when it finds a character that doesn't match, (i.e. for '</b>' or '</div>') it'll have to backtrack to before that < and start consuming the whole tag again, one character at a time. It has to do this right up to the end (almost) of the document where it finds the </body>.

    I dunno how bad it actually is, but it's crashed RegexBuddy attempting that expression just on the page source for this page (admittedly, I'm on a pretty slow computer atm).

    Using a greedy asterisk, the dot matches everything and reads right up to the end of the document straight away. It then backtracks one character at a time to find a <, checks if its followed by /, b, (etc.). If it fails, it goes back and continues giving up one character at a time till it finds the </body>

    And since we know all valid HTML documents will have a </body> tag, normally only a few characters away from the end of the document, we can save a lot of time by being greedy. Effectively the speed of the lazy asterisk is proportional to the length of the body, whereas the speed of the greedy asterisk is proportional to the length between the </body> and end of document.
     
    rodney88, Jun 29, 2007 IP
    nevetS likes this.
  14. amorph

    amorph Peon

    Messages:
    200
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #14
    Thank you all!
     
    amorph, Jun 29, 2007 IP