Extract 'title' and 'description' meta from a url?

Devilfish Active Member

Messages:: 396

Likes Received:: 6

Best Answers:: 1

Trophy Points:: 70

#1

Hi folks,

I was wondering if anyone knows of code that will extract the 'title' and 'description' meta from a url? I've been googling it all day and every code I try it doesn't work, I'm probably doing it wrong lol.

Any help would be great, thanks!

Devilfish, Nov 25, 2008 IP

kallell Peon

Messages:: 94

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#2

Devilfish said: ↑

Hi folks,

I was wondering if anyone knows of code that will extract the 'title' and 'description' meta from a url? I've been googling it all day and every code I try it doesn't work, I'm probably doing it wrong lol.

Any help would be great, thanks!
Click to expand...

Oh you mean do something like this?

bestofthenet.quoted4truth.com

kallell, Nov 25, 2008 IP

Devilfish Active Member

Messages:: 396

Likes Received:: 6

Best Answers:: 1

Trophy Points:: 70

#3

Not quite like that but I guess you're using the code I'm looking for.

This is more like what I want to achieve?

http://www.iwebtool.com/metatags_extractor

Devilfish, Nov 25, 2008 IP

kallell Peon

Messages:: 94

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#4

Devilfish said: ↑

Not quite like that but I guess you're using the code I'm looking for.

This is more like what I want to achieve?

http://www.iwebtool.com/metatags_extractor
Click to expand...

Ya, it's the same thing. i just get alot more....harder to get...data in mine.

What you want to do is something like this:

$delicious = file_get_contents("http://delicious.com/tag/news");

Will get you the source code

then you will need to do a pattern match (ie: preg, eregi, preg_match_all, etc)

something like this:

preg_match_all(
'/<h4>.*?<a rel="nofollow" class="taggedlink" href="(.*?)" >(.*?)<\/a>.*?<span class="delNavCount">(.*?)<\/span>/s',
$delicious,
$posts, // will contain the blog posts
PREG_SET_ORDER
// formats data into an array of posts
);

Then, you put the data into an array to call later:

foreach ($posts as $post) {
$de1 = $post[0];
$de2 = $post[1]; //link
$de3 = $post[2]; //title
$de4 = $post[3]; //score

}

Now output:

echo "All the data is: $de1";

echo "The Link will be: $de2";

etc. etc.

That's basically what you want. IN summary:

1- Get file contents (view source)
2- Find a pattern for what you need to extract (in your case, search for something LIKE this <meta>(.*?)</meta>)
3- Store data in to an array
4- Output the data

kallell, Nov 25, 2008 IP

Devilfish likes this.

Devilfish Active Member

Messages:: 396

Likes Received:: 6

Best Answers:: 1

Trophy Points:: 70

#5

Fantastic...thank you very much!

Of all the guides and code I've searched through today, that's been the easiest to follow.

I'll give it a go!

Thanks again! +rep added!

Devilfish, Nov 25, 2008 IP

Devilfish Active Member

Messages:: 396

Likes Received:: 6

Best Answers:: 1

Trophy Points:: 70

#6

I've come up against a little problem.

I'm entering in a url via a form and passing that url as a variable.
$data = file_get_contents_curl($url);
PHP:
I'm using the following code to extract the <title> from the url.
// retrieve page title 
function get_doc_title($url){ 
	$data = file_get_contents_curl($url);
	$spl=explode("<title>",$data);
	$spl2=explode("</title>",$spl[1]);
	$ret=trim($spl2[0]);
	if(strlen($ret)==0)
	{
		return(0);
	}
	else
	{
		return($ret);
	}
	
} 
PHP:
Now you only need to enter in the domain, not the whole address, i.e. digitalpoint.com. But it's quite common for webmasters to redirect a domain.com to the www.domain.com address. When this happens, you get a null or a 301 redirect returned from the above code.

Also, is the following line of code correct?
$data = file_get_contents_curl("www.".$url,$url);
PHP:
I want to extract the source of a given url, but if the url doesn't use www. and only uses the domain, some functions don't work. So is the above command ok to use? Does it use both www. and domain.com urls? I don't know if I'm explaining this correct lol.

EDIT: Just tested the above command and it doesn't work.

Is there any way I can overcome this?

Thanks!

Devilfish, Nov 25, 2008 IP

kallell Peon

Messages:: 94

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#7

Is there a reason your doing a curl over file_get_contents?

Anyway, here's a sample curl i use -- i have to do it to get the contents of digg, since they block the requests in httaccss for file_get_contents.

function download_pretending($url,$user_agent) {
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec ($ch);
curl_close ($ch);
return $result;
}
$digg = download_pretending("http://digg.com/", "MSIE");

After that, it's all a matter of using the same loop as i have above, so:

preg_match_all(
'/<div class="news-summary.*?<h3>.*?<a href="(.*?)".*?">(.*?)<.*?<div class="news-details">.*?href="(.*?)" class="tool comments">(.*?)<\/a>.*?<span class="tool user-info">.*?<\/span>.*?<strong id=".*?">(.*?)<\/strong>/s',
$digg,
$posts, // will contain the blog posts
PREG_SET_ORDER // formats data into an array of posts
);

foreach ($posts as $post) {
$d1 = $post[0];
$d2 = $post[1]; //link
$d3 = $post[2]; //title
$d4 = $post[3]; //comment link
$d5 = $post[4]; //comment count
$d6 = $post[5]; //digg count
}

echo $d4 or whatevs

Sorry, i dont have much more time atm to look in depth at your code, i just thought i could give you a sample of working curl code and maybe you could compare it.

kallell, Nov 26, 2008 IP

juust Peon

Messages:: 214

Likes Received:: 11

Best Answers:: 0

Trophy Points:: 0

#8

about following redirects : try add
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
to the file_get_contents_curl() function,
that allows curl to go chase the white rabbit

some hosts have an open_base_dir restriction,
then curl wont folllow redirects, if so, this might help :
http://www.edmondscommerce.co.uk/bl...followlocation-and-open_basedir-or-safe-mode/
I haven't tested it, though.
If you do, let me know if it works ?

about metatags :
http://us2.php.net/get_meta_tags
        $metadata = get_meta_tags($url);              //use: http://domain.com
        echo '<table width="100%">';
        print '<tr><td>Meta</td><td>Waarde</td></tr>';
        foreach($metadata as $naam => $waarde){
            echo '<tr><td valign="top">'.$naam.'</td><td>'.$waarde.'</td></tr>';
        }
        print '</table>';
PHP:

juust, Nov 26, 2008 IP

forkaya Peon

Messages:: 5

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#9

If you do not want to deal with preg_match_all, you can use DOMDocument class to load the HTML into and navigate through it. Here is an example of the code how to do it:

//create a new cURL resource pointing to specified url
$cURL = curl_init($aValues['url']);
//include the header in the output. 
curl_setopt($cURL,CURLOPT_HEADER,false);
//return the transfer as a string of the return value of curl_exec()
//instead of outputting it out directly. 
curl_setopt($cURL,CURLOPT_RETURNTRANSFER,true);
//set the request timeout in sec.
curl_setopt($cURL,CURLOPT_TIMEOUT,60);
//go after redirected pages
curl_setopt($cURL, CURLOPT_FOLLOWLOCATION, true);

//grab URL and assign it as string to variable
$reply_page = curl_exec($cURL);

//close cURL resource, and free up system resources
curl_close($cURL);

if (strlen($reply_page) == 0) {
	$eMsg .= 'Website unavailable.<br />';
	$isError = true;
} else {
	$pageDOM = new DOMDocument();
	@$pageDOM->loadHTML($reply_page);
	
	//Title
	$title_elements = $pageDOM->getElementsByTagName('title');
	if ($title_elements->length <> 0) {
		$aValues['title'] = $title_elements->item(0)->nodeValue;
	}
	
	$meta_elements = $pageDOM->getElementsByTagName('meta');
	foreach ($meta_elements as $meta_element) {
		if (strtolower($meta_element->getAttribute('name')) == 'description') {
	    	$aValues['description'] = $meta_element->getAttribute('content');
		}
		if (strtolower($meta_element->getAttribute('name')) == 'keywords') {
	    	$aValues['keywords'] = $meta_element->getAttribute('content');
		}
	}
}

PHP:

The full script can be found on forkaya.com/scripts/url-fetch.php

forkaya, Nov 28, 2008 IP

juust Peon

Messages:: 214

Likes Received:: 11

Best Answers:: 0

Trophy Points:: 0

#10

Thanks, I was looking for that one.

If I retrieve anchors with dom, how do I access tag attributes ?
Like " rel='nofollow' " ?

juust, Nov 29, 2008 IP

Log in or Sign up

Extract 'title' and 'description' meta from a url?

Devilfish Active Member

kallell Peon

Devilfish Active Member

kallell Peon

Devilfish Active Member

Devilfish Active Member

kallell Peon

juust Peon

forkaya Peon

juust Peon

Useful Searches