Hi folks, I was wondering if anyone knows of code that will extract the 'title' and 'description' meta from a url? I've been googling it all day and every code I try it doesn't work, I'm probably doing it wrong lol. Any help would be great, thanks!
Not quite like that but I guess you're using the code I'm looking for. This is more like what I want to achieve? http://www.iwebtool.com/metatags_extractor
Ya, it's the same thing. i just get alot more....harder to get...data in mine. What you want to do is something like this: $delicious = file_get_contents("http://delicious.com/tag/news"); Will get you the source code then you will need to do a pattern match (ie: preg, eregi, preg_match_all, etc) something like this: preg_match_all( '/<h4>.*?<a rel="nofollow" class="taggedlink" href="(.*?)" >(.*?)<\/a>.*?<span class="delNavCount">(.*?)<\/span>/s', $delicious, $posts, // will contain the blog posts PREG_SET_ORDER // formats data into an array of posts ); Then, you put the data into an array to call later: foreach ($posts as $post) { $de1 = $post[0]; $de2 = $post[1]; //link $de3 = $post[2]; //title $de4 = $post[3]; //score } Now output: echo "All the data is: $de1"; echo "The Link will be: $de2"; etc. etc. That's basically what you want. IN summary: 1- Get file contents (view source) 2- Find a pattern for what you need to extract (in your case, search for something LIKE this <meta>(.*?)</meta>) 3- Store data in to an array 4- Output the data
Fantastic...thank you very much! Of all the guides and code I've searched through today, that's been the easiest to follow. I'll give it a go! Thanks again! +rep added!
I've come up against a little problem. I'm entering in a url via a form and passing that url as a variable. $data = file_get_contents_curl($url); PHP: I'm using the following code to extract the <title> from the url. // retrieve page title function get_doc_title($url){ $data = file_get_contents_curl($url); $spl=explode("<title>",$data); $spl2=explode("</title>",$spl[1]); $ret=trim($spl2[0]); if(strlen($ret)==0) { return(0); } else { return($ret); } } PHP: Now you only need to enter in the domain, not the whole address, i.e. digitalpoint.com. But it's quite common for webmasters to redirect a domain.com to the www.domain.com address. When this happens, you get a null or a 301 redirect returned from the above code. Also, is the following line of code correct? $data = file_get_contents_curl("www.".$url,$url); PHP: I want to extract the source of a given url, but if the url doesn't use www. and only uses the domain, some functions don't work. So is the above command ok to use? Does it use both www. and domain.com urls? I don't know if I'm explaining this correct lol. EDIT: Just tested the above command and it doesn't work. Is there any way I can overcome this? Thanks!
Is there a reason your doing a curl over file_get_contents? Anyway, here's a sample curl i use -- i have to do it to get the contents of digg, since they block the requests in httaccss for file_get_contents. function download_pretending($url,$user_agent) { $ch = curl_init(); curl_setopt ($ch, CURLOPT_URL, $url); curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent); curl_setopt ($ch, CURLOPT_HEADER, 0); curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); $result = curl_exec ($ch); curl_close ($ch); return $result; } $digg = download_pretending("http://digg.com/", "MSIE"); After that, it's all a matter of using the same loop as i have above, so: preg_match_all( '/<div class="news-summary.*?<h3>.*?<a href="(.*?)".*?">(.*?)<.*?<div class="news-details">.*?href="(.*?)" class="tool comments">(.*?)<\/a>.*?<span class="tool user-info">.*?<\/span>.*?<strong id=".*?">(.*?)<\/strong>/s', $digg, $posts, // will contain the blog posts PREG_SET_ORDER // formats data into an array of posts ); foreach ($posts as $post) { $d1 = $post[0]; $d2 = $post[1]; //link $d3 = $post[2]; //title $d4 = $post[3]; //comment link $d5 = $post[4]; //comment count $d6 = $post[5]; //digg count } echo $d4 or whatevs Sorry, i dont have much more time atm to look in depth at your code, i just thought i could give you a sample of working curl code and maybe you could compare it.
about following redirects : try add curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1); to the file_get_contents_curl() function, that allows curl to go chase the white rabbit some hosts have an open_base_dir restriction, then curl wont folllow redirects, if so, this might help : http://www.edmondscommerce.co.uk/bl...followlocation-and-open_basedir-or-safe-mode/ I haven't tested it, though. If you do, let me know if it works ? about metatags : http://us2.php.net/get_meta_tags $metadata = get_meta_tags($url); //use: http://domain.com echo '<table width="100%">'; print '<tr><td>Meta</td><td>Waarde</td></tr>'; foreach($metadata as $naam => $waarde){ echo '<tr><td valign="top">'.$naam.'</td><td>'.$waarde.'</td></tr>'; } print '</table>'; PHP:
If you do not want to deal with preg_match_all, you can use DOMDocument class to load the HTML into and navigate through it. Here is an example of the code how to do it: //create a new cURL resource pointing to specified url $cURL = curl_init($aValues['url']); //include the header in the output. curl_setopt($cURL,CURLOPT_HEADER,false); //return the transfer as a string of the return value of curl_exec() //instead of outputting it out directly. curl_setopt($cURL,CURLOPT_RETURNTRANSFER,true); //set the request timeout in sec. curl_setopt($cURL,CURLOPT_TIMEOUT,60); //go after redirected pages curl_setopt($cURL, CURLOPT_FOLLOWLOCATION, true); //grab URL and assign it as string to variable $reply_page = curl_exec($cURL); //close cURL resource, and free up system resources curl_close($cURL); if (strlen($reply_page) == 0) { $eMsg .= 'Website unavailable.<br />'; $isError = true; } else { $pageDOM = new DOMDocument(); @$pageDOM->loadHTML($reply_page); //Title $title_elements = $pageDOM->getElementsByTagName('title'); if ($title_elements->length <> 0) { $aValues['title'] = $title_elements->item(0)->nodeValue; } $meta_elements = $pageDOM->getElementsByTagName('meta'); foreach ($meta_elements as $meta_element) { if (strtolower($meta_element->getAttribute('name')) == 'description') { $aValues['description'] = $meta_element->getAttribute('content'); } if (strtolower($meta_element->getAttribute('name')) == 'keywords') { $aValues['keywords'] = $meta_element->getAttribute('content'); } } } PHP: The full script can be found on forkaya.com/scripts/url-fetch.php
Thanks, I was looking for that one. If I retrieve anchors with dom, how do I access tag attributes ? Like " rel='nofollow' " ?