How to to extract TITLE, DESCRIPTION of a URL

Discussion in 'HTML & Website Design' started by rakibtg, Jun 21, 2011.

  1. #1
    Hello,
    here i am showing you a code that extract TITLE, DESCRIPTION perfectly of a page.
        <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
        <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
        <head>
        <title>amar titkle</title>
        <meta http-equiv="content-type" content="text/html;charset=utf-8" />
        <meta name="description" content="shob kisur bornona" />
        <meta name="keywords" content="eta ki, googl , bd, in" />
        <style type="text/css">
        #Box{border:1px solid #000044;text-align:center;background-color : #fff;}
        #green {color:green }
        </style>
        </head>
        <body><h1>Niche Dekhun</h1><br />
        <div id="Box">
        <script type="text/javascript">
        //<![CDATA[
        var a = (document.title);
        var b = (document.location.href);
        var c = document.getElementsByTagName('meta');
        var description;
        for (var x = 0, y = c.length; x < y; x++) {
        if (c[x].name.toLowerCase() == "description") {
        description = c[x];
        }
        }
        document.write('<a href="' + b + '"> ' + a + '<\/a> <br /> ' + description.content + '<br /> ');
        document.write('<span id="green"> ' + b + '<\/span> ');
        //]]>
        </script>
        </div>
        </body>
        </html> 
    Code (markup):
    What then if i want to extract TITLE, DESCRIPTION of a URL by this above code in my website? What types of changes i need to do in this code?
    Hoping a great response!
    Thankyou
     
    rakibtg, Jun 21, 2011 IP
  2. badmas

    badmas Well-Known Member

    Messages:
    117
    Likes Received:
    14
    Best Answers:
    0
    Trophy Points:
    130
    #2
    may we know how you are processing the new url ??

    Ajax is great way to perform it.
     
    badmas, Jun 21, 2011 IP
  3. rakibtg

    rakibtg Peon

    Messages:
    19
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    it is $attn
     
    rakibtg, Jun 21, 2011 IP
  4. Limetreeonline

    Limetreeonline Peon

    Messages:
    5
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #4
    I hope I can copy paste someone's code over here. This code I found from forkaya site. This code will fetch title, description and keyword from a url entered:

    <!--
    <?php
    /*
     * URL Fetch Script
     * 
     * This script fetches/extracts Title, Description, and Keywords from webpages
     * using specified URL
     * 
     * Provided by www.forkaya.com
     * 
     */
    
    
    /*
     * INITIALIZATION SECTION ***************************************************************************************************************************
     */
    
    	$isError = false;
    	$submitted = false;
    	$sourceVisible = false;
    	$sourceText_Visible = 'View Source Code';
    	$sourceText_NotVisible = 'Hide Source Code';
    	$eMsg = '';
    	$aValues = array(
    				'url'=>'',
    				'title'=>'',
    				'description'=>'',
    				'keywords'=>''
    				);
    	//script support charsets needed for encoding purposes
    	//add others if needed; it will require custom coding; look for 'charset custom' comments below
    	//keep charsets lowercase			
    	$aCharsets = array(
    				'utf-8', //Unicode
    				'iso-8859-1' //Western Europe
    				//charset custom: add other charsets as needed
    				//'windows-1258' //Vietnamese
    				);
    
    /*
     * FUNCTIONS SECTION ***************************************************************************************************************************
     */
    	//this function will determine the website's charset
    	function get_charset($aCS,$website) {
    		
    		$result = '';
    		$website = strtolower($website); 
    		
    		//check the http header first
    		$pos = strpos($website,'<html');
    		if ($pos) {
    		    $wsHeader = substr($website,0,$pos);
    			//loop through array of charsets
    			foreach ($aCS as $val) {
    			
    				if (strpos($wsHeader,$val) > 0) {
    					$result = $val;
    					break;
    				}
    			}
    		}
    
    		if (empty($result)) {
    			
    			//supported charset was not found in the http header
    
    			$wsContentType = '';
    			
    			$wsDOM = new DOMDocument();
    			@$wsDOM->loadHTML($website);
    			
    			$meta_elements = $wsDOM->getElementsByTagName('meta');
    			foreach ($meta_elements as $meta_element) {
    				if (strtolower($meta_element->getAttribute('http-equiv')) == 'content-type') {
    			    	$wsContentType = strtolower($meta_element->getAttribute('content'));
    				}
    			}
    			
    			if ($wsContentType === '') {
    				//return empty
    			} else {
    				// look for specific charsets
    				
    				//loop through array of charsets
    				foreach ($aCS as $val) {
    				
    					if (strpos($wsContentType,$val) > 0) {
    						$result = $val;
    						break;
    					}
    				}
    			}
    		}
    		
    		return $result;
    	}
    	
    /*
     * VALIDATION AND ACTION SECTION ********************************************************************************************************************
     */
    	
    	if (isset($_POST['submit'])) {
    
    		$submitted = true;
    		$aValues['url'] = $_POST['url'];
    		$aValues['title'] = 'No title';
    		$aValues['description'] = 'No description';
    		$aValues['keywords'] = 'No keywords';
    		
    		if (strlen($_POST['url']) == 0) {
    			$eMsg .= 'URL cannot be blank.<br />';
    			$isError = true;
    		}
    
    		if(!$isError) {
    
    			//create a new cURL resource pointing to specified url
    			$cURL = curl_init($aValues['url']);
    			//include the header in the output. 
    			curl_setopt($cURL,CURLOPT_HEADER,true);
    			//return the transfer as a string of the return value of curl_exec()
    			//instead of outputting it out directly. 
    			curl_setopt($cURL,CURLOPT_RETURNTRANSFER,true);
    			//set the request timeout in sec.
    			curl_setopt($cURL,CURLOPT_TIMEOUT,60);
    			//go after redirected pages
    			curl_setopt($cURL, CURLOPT_FOLLOWLOCATION, true);
    			
    			//grab URL and assign it as string to variable
    			$reply_page = curl_exec($cURL);
    
    			//echo('<--'.$reply_page.'-->');
    			
    			//close cURL resource, and free up system resources
    			curl_close($cURL);
    			
    			if (strlen($reply_page) == 0) {
    				$eMsg .= 'Website unavailable.<br />';
    				$isError = true;
    			} else {
    				
    				//determine the website's charset
    				$wbCharset = get_charset($aCharsets,$reply_page);
    				
    				//we do not need header anymore
    				$reply_page = strstr($reply_page,'<html');
    				
    				//we need to convert to utf-8 because DOMDocument expects it
    				switch ($wbCharset) {
    					case '':
    						//do nothing
    						break;
    						
    					case 'utf-8':
    						
    						//for the purpose of this script, we can replace 'iso-8859-1' strings with 'utf-8' (if there are any) in the whole website
    						$reply_page = str_ireplace('iso-8859-1','utf-8',$reply_page);
    						break;
    						
    					case 'iso-8859-1':
    
    						//for the purpose of this script, we can replace 'iso-8859-1' with 'utf-8' in the whole website
    						$reply_page = str_ireplace('iso-8859-1','utf-8',$reply_page);
    						
    						//encode the website into utf-8
    						$reply_page = utf8_encode($reply_page);
    						break;
    						
    					//charset custom: add logic for other charsets as needed
    					//case 'windows-1258': //Vietnamese
    					//	$reply_page = str_ireplace('windows-1258','utf-8',$reply_page);
    					//	write or find a code to encode the charset to utf-8
    					//	break;
    				}
    
    				//for the purpose of this script, 
    				//we can add <meta http-equiv=Content-Type content="text/html; charset=utf-8"> tag
    				//right after <head> tag to make DOM 'happy'
    				$reply_page = str_ireplace(
    					'<head>',
    					'<head><meta http-equiv=Content-Type content="text/html; charset=utf-8">',
    					$reply_page);
    				
    				$pageDOM = new DOMDocument();
    				@$pageDOM->loadHTML($reply_page);
    				
    				//Title
    				$title_elements = $pageDOM->getElementsByTagName('title');
    				if ($title_elements->length <> 0) {
    					$aValues['title'] = $title_elements->item(0)->nodeValue;
    				}
    				
    				$meta_elements = $pageDOM->getElementsByTagName('meta');
    				foreach ($meta_elements as $meta_element) {
    					if (strtolower($meta_element->getAttribute('name')) == 'description') {
    				    	$aValues['description'] = $meta_element->getAttribute('content');
    					}
    					if (strtolower($meta_element->getAttribute('name')) == 'keywords') {
    				    	$aValues['keywords'] = $meta_element->getAttribute('content');
    					}
    				}
    			}
    		}
    		
    	}
    	
    	if (isset($_GET['source'])) {
    		if ($_GET['source'] == 1) {
    			$sourceStr = file_get_contents('url-fetch-source.php');
    			$sourceVisible = true;
    			$sourceText = $sourceText_NotVisible;
    			$sourceValue = 0;
    		} else {
    			$sourceVisible = false;
    			$sourceText = $sourceText_Visible;
    			$sourceValue = 1;
    		}
    	} else {
    		$sourceText = $sourceText_Visible;
    		$sourceValue = 1;
    	}
    	
    	header('Content-Type: text/html; charset=utf-8');
    	
    /*
     * DISPLAY SECTION **********************************************************************************************************************************
     */
    ?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
    <html>
    	<head>
    		<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    		<meta name="description" content="This PHP script extracts Title, Description, and Keywords from specified URL"/>
    		<meta name="keywords" content="php scripting, php, extract, fetch, meta"/>
    		<meta name="author" content="forkaya" />
    		<link rel="stylesheet" href="../style.css" type="text/css">
    		<title>Forkaya - PHP Scripts - URL Fetch - Extract Title, Description, and Keywords from URL</title>
    	</head>
    	<body>
    		<form action="" method="post">
    			<table 	align="left" class="bb">
    				<tr>
    					<td colspan="2" height="30" align="center"  class="aa"><h3><a href="..">Forkaya</a> - <a href=".">PHP Scripts</a> - <a href="./url-fetch.php">URL Fetch</a></h3></td>
    				</tr>
    				<tr>
    					<td colspan="2" height="30" align="left">This script fetches/extracts Title, Description, and Keywords from webpages using specified URL</td>
    				</tr>
    				<tr>
    					<td colspan="2" height="30"></td>
    				</tr>
    <?php
    
    	if($isError) { 
    		echo('
    				<tr>
    					<td colspan="2" align="left" class="cc">'.$eMsg.'</td>
    				</tr>
    		'); 
    	}
    
    ?>
    				<tr>
    					<td align="left">Enter URL:</td>
    					<td align="left"><input type="text" name="url" maxlength="256" size="56" value="<?php echo($aValues['url']);?>"/></td>
    				</tr>
    				<tr>
    					<td align="left"><input type="submit" name="submit" value="Submit"/></td>
    					<td align="left"><a href="url-fetch.php?source=<?php echo($sourceValue);?>" class="ff"><?php echo($sourceText); ?></a></td>
    				</tr>
    				<tr>
    					<td colspan="2" align="left"></td>
    				</tr>
    <?php
    
    	if($submitted and !$isError) { 
    		echo('
    				<tr>
    					<td align="left" valign="top" class="aa">Title: </td>
    					<td align="left">'.$aValues['title'].'</td>
    				</tr>
    				<tr>
    					<td align="left" valign="top" class="aa">Description:</td>
    					<td>'.$aValues['description'].'</td>
    				</tr>
    				<tr>
    					<td align="left" valign="top" class="aa">Keywords:</td>
    					<td align="left">'.$aValues['keywords'].'</td>
    				</tr>
    		'); 
    	}
    
    	if($sourceVisible) { 
    		echo('
    				<tr>
    					<td align="left" colspan="2"><textarea rows="174" cols="210" readonly="readonly" class="ee">'.$sourceStr.'
    HTML:
     
    Limetreeonline, Jun 22, 2011 IP
  5. rakibtg

    rakibtg Peon

    Messages:
    19
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #5
    is this code completed?
     
    Last edited: Jun 23, 2011
    rakibtg, Jun 23, 2011 IP
  6. BRUm

    BRUm Well-Known Member

    Messages:
    3,086
    Likes Received:
    61
    Best Answers:
    1
    Trophy Points:
    100
    #6
    That's a lot of code. Unless you really need the client's browser to extract these things, just use PHP's REGEX functions and simply return it.

    
    <?php
        $html = file_get_contents("http://awebsite.com");
        preg_match("/<title>.*<\/title>/", $html, $title); // Gets title
        preg_match("/<description>.*<\/description>/", $html, $desc); // Gets description
        print $title[0];
        print $desc[0];
    ?>
    
    PHP:
     
    BRUm, Jun 25, 2011 IP
  7. charlessconcepts

    charlessconcepts Peon

    Messages:
    1
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #7
    I have used this as well.

    $metatagarray = get_meta_tags( $URL );
    $keywords = $metatagarray[ "keywords" ];
    $description = $metatagarray[ "description" ];
    $author = $metatagarray[ "author" ];

    But have a small issue. I have been trying to extract the first image within the content/body section with no success.

    preg_match("/<img[^>]+>/i", $html, $image);

    I want the image url itself to put into page as an <img src=
    Any ideas?

    Thanks
    Charles.
     
    charlessconcepts, Aug 6, 2011 IP