Find Links in a webpage ..

ved2210 Peon

Messages:: 8

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#1

<?php

function only_links($var)
{

echo $var;
$str=$var ;
$str=preg_match("/href='([^']*)'/", $str, $regs);
$new_str= $regs[1];
$var=substr($new_str,7,strlen($new_str));
echo "$var";
return($var);

}

$fh=file("index.html");
array_filter($fh,"only_links");
print_r($fh);

?>

Index.html is a webpage which contains all kind of data including the page links . My main task is to find out all the links from that page . Here with this code i am trying to take them all in an array . Is it a good way to do this task ? please help me with that . Even my this code is not running , i don't know the reason . I will appreciate your help . Thanks for reading this thread .
Cheers .

ved2210, Nov 1, 2009 IP

DansTuts Guest

Messages:: 923

Likes Received:: 7

Best Answers:: 0

Trophy Points:: 0

#2

<?php 
	
	// Coded by Daniel Clarke (Danstuts) 
	// Use Curl to open up the website - with a timeout of 60 to avoid wasting resources

	function opens($url) {
		$ch = curl_init();
		$useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1";
		curl_setopt($ch, CURLOPT_HEADER, 0);
		curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); //Set curl to return the data instead of printing it to the browser.
		curl_setopt($ch, CURLOPT_URL, $url);
		curl_setopt ($ch, CURLOPT_REFERER, 'http://www.google.com');
		curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/6.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6");
		curl_setopt ($ch, CURLOPT_TIMEOUT, 60);
		$data = curl_exec($ch);
		curl_close($ch);
	 
		return $data;
	}
	
	// Set the page to be opened.
		
	$url = "http://news.bbc.co.uk/1/hi/world/asia-pacific/8336564.stm";
	
	// use the curl function to grab the page contents.
	
    $webpage = opens($url);
	
	// Use some regex to grab all the web URLs from the page we've opened.
	
    preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>/", $webpage, &$urlmatch);
        
	// urls now contains all the URL's that have been matched. It's currently in array
	
    $urls = $urlmatch[1];
	
	// For testing echo each item of the array (the urls in this case) 

    foreach($urls as $var)
    {    
        echo($var."<br>");
    }
	
?>

PHP:

DansTuts, Nov 1, 2009 IP

ved2210 Peon

Messages:: 8

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#3

DansTuts said: ↑

<?php 
	
	// Coded by Daniel Clarke (Danstuts) 
	// Use Curl to open up the website - with a timeout of 60 to avoid wasting resources

	function opens($url) {
		$ch = curl_init();
		$useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1";
		curl_setopt($ch, CURLOPT_HEADER, 0);
		curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); //Set curl to return the data instead of printing it to the browser.
		curl_setopt($ch, CURLOPT_URL, $url);
		curl_setopt ($ch, CURLOPT_REFERER, 'http://www.google.com');
		curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/6.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6");
		curl_setopt ($ch, CURLOPT_TIMEOUT, 60);
		$data = curl_exec($ch);
		curl_close($ch);
	 
		return $data;
	}
	
	// Set the page to be opened.
		
	$url = "http://news.bbc.co.uk/1/hi/world/asia-pacific/8336564.stm";
	
	// use the curl function to grab the page contents.
	
    $webpage = opens($url);
	
	// Use some regex to grab all the web URLs from the page we've opened.
	
    preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>/", $webpage, &$urlmatch);
        
	// urls now contains all the URL's that have been matched. It's currently in array
	
    $urls = $urlmatch[1];
	
	// For testing echo each item of the array (the urls in this case) 

    foreach($urls as $var)
    {    
        echo($var."<br>");
    }
	
?>

PHP:

Click to expand...

Thank you so much sir !!
I am just going to play with it .
I ill let you know if i need further help .

Thanks again ,
You are best !!

Cheers.

ved2210, Nov 1, 2009 IP

ved2210 Peon

Messages:: 8

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#4

Call to undefined function curl_init() .
What to do ?

how should i use your this function in my code ?

Thanks

ved2210, Nov 1, 2009 IP

DansTuts Guest

Messages:: 923

Likes Received:: 7

Best Answers:: 0

Trophy Points:: 0

#5

Your host does not have Curl installed. Try this instead:


<?php 
	
	// Coded by Daniel Clarke (Danstuts) 
	// Use Curl to open up the website - with a timeout of 60 to avoid wasting resources

	function opens($url) {
		return file_get_contents($url); 
	}
	
	// Set the page to be opened.
		
	$url = "http://news.bbc.co.uk/1/hi/world/asia-pacific/8336564.stm";
	
	// use the curl function to grab the page contents.
	
    $webpage = opens($url);
	
	// Use some regex to grab all the web URLs from the page we've opened.
	
    preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>([^<]+|.*?)?<\/a>/", $webpage, &$urlmatch);
        
	// urls now contains all the URL's that have been matched. It's currently in array
	
    $urls = $urlmatch[1];
	
	// For testing echo each item of the array (the urls in this case) 

    foreach($urls as $var)
    {    
        echo($var."<br>");
    }
	
?>

PHP:

DansTuts, Nov 1, 2009 IP

ved2210 Peon

Messages:: 8

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#6

Hello Sir ,
It worked .

But i have one more question .
I want only html pages and i want to clear all other files like .jpg .pdf . doc .
What should i do ?

basically i want to filter it and want only html pages .

Thanks
Ved

ved2210, Nov 1, 2009 IP

Log in or Sign up

Find Links in a webpage ..

ved2210 Peon

DansTuts Guest

ved2210 Peon

ved2210 Peon

DansTuts Guest

ved2210 Peon

Useful Searches