Parsing link content whit preg_replace

Discussion in 'PHP' started by Argento, May 26, 2009.

  1. #1
    Hi, iam trying to parse the director name (red part):

    <div id="director-info" class="info">
    <h5>Director:</h5>
    <a href="/name/nm0004716/">[COLOR="Red"]Darren Aronofsky[/COLOR]</a><br/>
    </div>
    Code (markup):
    i tried this, but i cant make it, how should i do it ???

    preg_match('/director:<\/h5><a href=\"([^\"]*)\">(.*)<\/a>/i', $file, $matches)
    Code (markup):
    thanks alot !
     
    Argento, May 26, 2009 IP
  2. koko5

    koko5 Active Member

    Messages:
    394
    Likes Received:
    14
    Best Answers:
    1
    Trophy Points:
    70
    #2
    Hi,

    You have to strip \r\n for the input string ($file):
    preg_match('/director:\<\/h5\>\<a href=\"([^\"]*)\"\>(.*)?\<\/a\>/i', preg_replace('#(\r?\n)+#','',$file),$matches);
    PHP:
    Regards
     
    koko5, May 27, 2009 IP
  3. Argento

    Argento Active Member

    Messages:
    69
    Likes Received:
    2
    Best Answers:
    1
    Trophy Points:
    53
    #3
    thanks koko, i have a problem, i really suck whit regular expressions because i taked some rules from the internet, but i really dont understand well.

    The array is returning two values, the first one "/name/nm0004716/" (href content), but not Darren Aronofsky (this is the value that i need), and the second value of the array returns all the web content.

    How can i solve it ? and it is any nice tutorial to learn about regular expressions to parse contents ? thanks and sory for my english !
     
    Argento, May 27, 2009 IP
  4. koko5

    koko5 Active Member

    Messages:
    394
    Likes Received:
    14
    Best Answers:
    1
    Trophy Points:
    70
    #4
    Hi, Argento

    the returned result is array and its size depends on round brackets you use in your regular expression.
    Here is an example:
    
    $file='<div id="director-info" class="info">
    <h5>Director:</h5>
    <a href="/name/nm0004716/">Darren Aronofsky</a><br/>
    </div>';
    $matches=array();
    preg_match('/director:\<\/h5\>\<a href=\"([^\"]*)\"\>(.*)?\<\/a\>/i', preg_replace('#(\r?\n)+#','',$file),$matches);
    print_r($matches);
    
    PHP:
    Now, let's remove ...href=\"([^\"]*)... because we don't need the href value, but only innerHTML instead:
    
    Array
    (
        [0] => Director:</h5><a href="/name/nm0004716/">Darren Aronofsky</a>
        [1] => Darren Aronofsky
    )
    
    PHP:
    Hope it's now a little bit clear.
    Regards
     
    koko5, May 27, 2009 IP
  5. Argento

    Argento Active Member

    Messages:
    69
    Likes Received:
    2
    Best Answers:
    1
    Trophy Points:
    53
    #5
    yeah, it works fine in the example, i see that my problem it is whit the entire code, when i convert the url content to an string:

    $url = "http://www.imdb.com/title/tt1125849/";
    
    function get_imdb($url)
    {
       if (!($file = file_get_contents($url)))
          trigger_error('Imposible to return imdb page', E_USER_ERROR);
       if (!preg_match('/director:\<\/h5\>\<a href=\"([^\"]*)\"\>(.*)?\<\/a\>/i', preg_replace('#(\r?\n)+#','',$file),$matches))
          trigger_error('Unable to parse IMDB response', E_USER_ERROR);
       return $matches[1];
    }
    
    $resultado = get_imdb($url);
    echo $resultado;
    Code (markup):
    Why it dosent work in this case ?

    Thanks koko !
     
    Argento, May 27, 2009 IP
  6. JDevereux

    JDevereux Peon

    Messages:
    50
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #6
    You could also do this using DOM and XPath:

    $html = file_get_contents('http://www.imdb.com/title/tt1125849/');
    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    
    
    $xpath = new DOMXPath($dom);
    $hrefs = $xpath->evaluate("/html/body//div[@id='director-info']//a");
    
    for ($i = 0; $i < $hrefs->length; $i++) {
    	$href = $hrefs->item($i);
    	echo $href->firstChild->data . '<br />';
      echo $href->getAttribute('href');
      }	
    PHP:
     
    JDevereux, May 27, 2009 IP
  7. koko5

    koko5 Active Member

    Messages:
    394
    Likes Received:
    14
    Best Answers:
    1
    Trophy Points:
    70
    #7
    Because incomming data comes escaped and you have to stripslashes:
    preg_match('/director:\<\/h5\>\<a href=\"([^\"])*\"\>(.*)?\<\/a\>/i', stripslashes(preg_replace('#(\r?\n)+#','',$file)),$matches)
    PHP:
    btw as JDevereux wrote, it's better using DOM is this case.

    Regards
     
    koko5, May 27, 2009 IP