Hi, iam trying to parse the director name (red part): <div id="director-info" class="info"> <h5>Director:</h5> <a href="/name/nm0004716/">[COLOR="Red"]Darren Aronofsky[/COLOR]</a><br/> </div> Code (markup): i tried this, but i cant make it, how should i do it ??? preg_match('/director:<\/h5><a href=\"([^\"]*)\">(.*)<\/a>/i', $file, $matches) Code (markup): thanks alot !
Hi, You have to strip \r\n for the input string ($file): preg_match('/director:\<\/h5\>\<a href=\"([^\"]*)\"\>(.*)?\<\/a\>/i', preg_replace('#(\r?\n)+#','',$file),$matches); PHP: Regards
thanks koko, i have a problem, i really suck whit regular expressions because i taked some rules from the internet, but i really dont understand well. The array is returning two values, the first one "/name/nm0004716/" (href content), but not Darren Aronofsky (this is the value that i need), and the second value of the array returns all the web content. How can i solve it ? and it is any nice tutorial to learn about regular expressions to parse contents ? thanks and sory for my english !
Hi, Argento the returned result is array and its size depends on round brackets you use in your regular expression. Here is an example: $file='<div id="director-info" class="info"> <h5>Director:</h5> <a href="/name/nm0004716/">Darren Aronofsky</a><br/> </div>'; $matches=array(); preg_match('/director:\<\/h5\>\<a href=\"([^\"]*)\"\>(.*)?\<\/a\>/i', preg_replace('#(\r?\n)+#','',$file),$matches); print_r($matches); PHP: Now, let's remove ...href=\"([^\"]*)... because we don't need the href value, but only innerHTML instead: Array ( [0] => Director:</h5><a href="/name/nm0004716/">Darren Aronofsky</a> [1] => Darren Aronofsky ) PHP: Hope it's now a little bit clear. Regards
yeah, it works fine in the example, i see that my problem it is whit the entire code, when i convert the url content to an string: $url = "http://www.imdb.com/title/tt1125849/"; function get_imdb($url) { if (!($file = file_get_contents($url))) trigger_error('Imposible to return imdb page', E_USER_ERROR); if (!preg_match('/director:\<\/h5\>\<a href=\"([^\"]*)\"\>(.*)?\<\/a\>/i', preg_replace('#(\r?\n)+#','',$file),$matches)) trigger_error('Unable to parse IMDB response', E_USER_ERROR); return $matches[1]; } $resultado = get_imdb($url); echo $resultado; Code (markup): Why it dosent work in this case ? Thanks koko !
You could also do this using DOM and XPath: $html = file_get_contents('http://www.imdb.com/title/tt1125849/'); $dom = new DOMDocument(); @$dom->loadHTML($html); $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//div[@id='director-info']//a"); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); echo $href->firstChild->data . '<br />'; echo $href->getAttribute('href'); } PHP:
Because incomming data comes escaped and you have to stripslashes: preg_match('/director:\<\/h5\>\<a href=\"([^\"])*\"\>(.*)?\<\/a\>/i', stripslashes(preg_replace('#(\r?\n)+#','',$file)),$matches) PHP: btw as JDevereux wrote, it's better using DOM is this case. Regards