how can i extract all text in html page between the <body> </body> tags ?

Discussion in 'PHP' started by ramysarwat, Nov 5, 2009.

  1. #1
    how can i extract all text in html page between the <body> </body> tags ?
     
    ramysarwat, Nov 5, 2009 IP
  2. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #2
    
    if (preg_match('~<body[^>]*>(.*?)</body>~si', $text, $body))
    {
        echo $body[1];
    }
    
    PHP:
     
    nico_swd, Nov 5, 2009 IP
  3. ramysarwat

    ramysarwat Peon

    Messages:
    164
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    thank you nico_swd i try this code but never give any output. any idea why ?

    <?php
    $text = file_get_contents("http://www.google.com/");
    if (preg_match('~<body[^>]*>(.*?)</body>~si', $text, $body)){
    echo $body[1];
    }

    ?>
     
    ramysarwat, Nov 5, 2009 IP
  4. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #4
    Because Google will redirect you, and file_get_contents() doesn't follow redirects. Try another domain and it'll work.
     
    nico_swd, Nov 5, 2009 IP
  5. ramysarwat

    ramysarwat Peon

    Messages:
    164
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #5
    i try it on 3 other web sites with contents but noting hapen too. any other ideas ?
     
    ramysarwat, Nov 5, 2009 IP
  6. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #6
    
    $ch = curl_init('http://nicoswd.com/');
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    
    $text = curl_exec();
    
    if (preg_match('~<body[^>]*>(.*?)</body>~si', $text, $body))
    {
        echo $body[1];
    }
    
    PHP:
     
    nico_swd, Nov 5, 2009 IP
  7. ramysarwat

    ramysarwat Peon

    Messages:
    164
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #7
    i can't belive it the same resault with curl too

    when i read the output of curl or file get contents i get the out put but when i use preg_match i get nothing
     
    ramysarwat, Nov 5, 2009 IP
  8. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #8
    Which domains have you tried?
     
    nico_swd, Nov 5, 2009 IP
  9. ramysarwat

    ramysarwat Peon

    Messages:
    164
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #9
    ramysarwat, Nov 5, 2009 IP
  10. mony911

    mony911 Peon

    Messages:
    114
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #10
    try this.. this will work...


    this is written by Bony Yousuf.. original post is here..

    http://www.sitepoint.com/forums/showthread.php?t=643722
     
    mony911, Nov 5, 2009 IP
  11. unigogo

    unigogo Peon

    Messages:
    286
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    0
    #11
    remove carriage returns
    $str = preg_replace("/\r/", $html, "\s");

    retrieve html between body tags
    preg_match("/<\s*body.*>.*/", $str, $body);

    $result = preg_split("/<(.|\n)*?>/", $body);

    I tried steps here,
    http://www.pagecolumn.com/tool/pregtest.htm
     
    Last edited: Nov 5, 2009
    unigogo, Nov 5, 2009 IP
  12. Izonedig

    Izonedig Member

    Messages:
    150
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    28
    #12
    Izonedig, Feb 17, 2010 IP
  13. danx10

    danx10 Peon

    Messages:
    1,179
    Likes Received:
    44
    Best Answers:
    2
    Trophy Points:
    0
    #13
    Make sure the actual site has a body tag.

    <?php
    
    $site = file_get_contents("http://en.wikipedia.org/wiki/Benchmark");
    
    preg_match("/<body[^>]*>(.*?)<\/body>/is", $site, $matches);
    
    highlight_string($matches[1]);
    
    ?>
    PHP:
    Another example....

    <?php
    
    $site = file_get_contents("http://www.google.com/codesearch");
    
    preg_match("/<body[^>]*>(.*?)<\/body>/is", $site, $matches);
    
    highlight_string($matches[1]);
    
    ?>
    PHP:
     
    danx10, Feb 17, 2010 IP