Find Links in a webpage ..

Discussion in 'PHP' started by ved2210, Nov 1, 2009.

  1. #1
    <?php

    function only_links($var)
    {

    echo $var;
    $str=$var ;
    $str=preg_match("/href='([^']*)'/", $str, $regs);
    $new_str= $regs[1];
    $var=substr($new_str,7,strlen($new_str));
    echo "$var";
    return($var);

    }


    $fh=file("index.html");
    array_filter($fh,"only_links");
    print_r($fh);


    ?>


    Index.html is a webpage which contains all kind of data including the page links . My main task is to find out all the links from that page . Here with this code i am trying to take them all in an array . Is it a good way to do this task ? please help me with that . Even my this code is not running , i don't know the reason . I will appreciate your help . Thanks for reading this thread .
    Cheers .:)
     
    ved2210, Nov 1, 2009 IP
  2. DansTuts

    DansTuts Guest

    Messages:
    923
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #2
    <?php 
    	
    	// Coded by Daniel Clarke (Danstuts) 
    	// Use Curl to open up the website - with a timeout of 60 to avoid wasting resources
    
    	function opens($url) {
    		$ch = curl_init();
    		$useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1";
    		curl_setopt($ch, CURLOPT_HEADER, 0);
    		curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); //Set curl to return the data instead of printing it to the browser.
    		curl_setopt($ch, CURLOPT_URL, $url);
    		curl_setopt ($ch, CURLOPT_REFERER, 'http://www.google.com');
    		curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/6.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6");
    		curl_setopt ($ch, CURLOPT_TIMEOUT, 60);
    		$data = curl_exec($ch);
    		curl_close($ch);
    	 
    		return $data;
    	}
    	
    	// Set the page to be opened.
    		
    	$url = "http://news.bbc.co.uk/1/hi/world/asia-pacific/8336564.stm";
    	
    	// use the curl function to grab the page contents.
    	
        $webpage = opens($url);
    	
    	// Use some regex to grab all the web URLs from the page we've opened.
    	
        preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>/", $webpage, &$urlmatch);
            
    	// urls now contains all the URL's that have been matched. It's currently in array
    	
        $urls = $urlmatch[1];
    	
    	// For testing echo each item of the array (the urls in this case) 
    
        foreach($urls as $var)
        {    
            echo($var."<br>");
        }
    	
    ?>
    PHP:
     
    DansTuts, Nov 1, 2009 IP
  3. ved2210

    ved2210 Peon

    Messages:
    8
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Thank you so much sir !!
    I am just going to play with it .
    I ill let you know if i need further help .

    Thanks again ,
    You are best !!

    Cheers.
     
    ved2210, Nov 1, 2009 IP
  4. ved2210

    ved2210 Peon

    Messages:
    8
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Call to undefined function curl_init() .
    What to do ?

    how should i use your this function in my code ?

    Thanks
     
    ved2210, Nov 1, 2009 IP
  5. DansTuts

    DansTuts Guest

    Messages:
    923
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Your host does not have Curl installed. Try this instead:

    
    <?php 
    	
    	// Coded by Daniel Clarke (Danstuts) 
    	// Use Curl to open up the website - with a timeout of 60 to avoid wasting resources
    
    	function opens($url) {
    		return file_get_contents($url); 
    	}
    	
    	// Set the page to be opened.
    		
    	$url = "http://news.bbc.co.uk/1/hi/world/asia-pacific/8336564.stm";
    	
    	// use the curl function to grab the page contents.
    	
        $webpage = opens($url);
    	
    	// Use some regex to grab all the web URLs from the page we've opened.
    	
        preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>/", $webpage, &$urlmatch);
            
    	// urls now contains all the URL's that have been matched. It's currently in array
    	
        $urls = $urlmatch[1];
    	
    	// For testing echo each item of the array (the urls in this case) 
    
        foreach($urls as $var)
        {    
            echo($var."<br>");
        }
    	
    ?>
    
    PHP:
     
    DansTuts, Nov 1, 2009 IP
  6. ved2210

    ved2210 Peon

    Messages:
    8
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Hello Sir ,
    It worked .

    But i have one more question .
    I want only html pages and i want to clear all other files like .jpg .pdf . doc .
    What should i do ?

    basically i want to filter it and want only html pages .


    Thanks
    Ved
     
    ved2210, Nov 1, 2009 IP