parser for HTML

Discussion in 'PHP' started by stats, Jul 6, 2007.

  1. #1
    Hi guys

    I am trying to write a little parser function for HTML like this

    function getHtmlContent(string $url, string $head, string $tail)

    the function should go to the page specified by $url and selectively grab from there ANY content that is surrounded by the FIRST occurance of $head and FIRST occurance of $tail

    for example, if i have an html like this:

    begin
    111 end
    begin 222
    end begin 333 end ...

    it should only grab the "begin \n111 end" at the first pass, OR grab them all at once but put them all in separate array elements.

    so at the end i will either end up with "begin \n111 end" or with an array like result[0]="begin \n111 end" , result[1]="begin 222\n end", result[2]="begin 333 end"

    The array case is prefferable


    Can anyone please help me with this ?

    right now i have come up with the folowing code
    $url = "http://us2.php.net/preg_match_all";
    $html = file_get_contents($url);
    $head = "<option";
    $tail = "<\/option>";
    
    function getHtmlContent($page, $head, $tail) {
           $regex="/$head(.*\n*)*$tail/";
           preg_match_all($regex, $page, $m);
           return $m[0];
    }
    
    foreach ( getHtmlContent($html, $head, $tail) as $match) {
                    echo $match;
    }
    Code (markup):
    it works for SOME sites and SOME $head and $tail, but for example with the values above - it won't work
     
    stats, Jul 6, 2007 IP
  2. SuperMarketer

    SuperMarketer Peon

    Messages:
    5
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #2
    I have no idea what that is.
    You are a smart one.
     
    SuperMarketer, Jul 6, 2007 IP
  3. stats

    stats Well-Known Member

    Messages:
    586
    Likes Received:
    8
    Best Answers:
    0
    Trophy Points:
    110
    #3
    Thanks for your valuable idea .. :)

    anyone else can please help me ?

    I guess i wrote the regexp incorrect in my function .. what i want it to be is a regexp that will match ANYTHING that may be seen on a webpage's code, including all the special symbols and "new lines" and everything else ..

    so i wrote (.*\n*)* .. but guess that's not enough
     
    stats, Jul 6, 2007 IP
  4. Barti1987

    Barti1987 Well-Known Member

    Messages:
    2,703
    Likes Received:
    115
    Best Answers:
    0
    Trophy Points:
    185
    #4
    
    $regex="/$head(.*)$tail/s";
    
    PHP:
    Source.

    Peace,
     
    Barti1987, Jul 6, 2007 IP