how to collect all links from a web page ?

Discussion in 'PHP' started by ramysarwat, Oct 25, 2009.

  1. #1
    how can i collect all linke from a web page ?

    i try some codes but it gives me just the last part of the url not the full url so if the url is http://www.example.com/test.html it returns test.html
     
    ramysarwat, Oct 25, 2009 IP
  2. mastermunj

    mastermunj Well-Known Member

    Messages:
    687
    Likes Received:
    13
    Best Answers:
    0
    Trophy Points:
    110
    #2
    can you give more details along with examples of what exactly you need to achieve?
     
    mastermunj, Oct 25, 2009 IP
  3. ramysarwat

    ramysarwat Peon

    Messages:
    164
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    i want to collect all urls from a any web page

    for example when i type http://www.google.com i get all links from this page but in full url because i try this in other sites before using file_get_con
    tents and i get only the pages file name like example.html but i need the full url
     
    ramysarwat, Oct 25, 2009 IP
  4. jpatrick85

    jpatrick85 Member

    Messages:
    74
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    43
    #4
    I think you can use file_get_contents, then use preg_match to find all links that fit the pattern "<a href=". If you want to append the domain, just use the known domain (or use another preg_match). If you give us some code, that might help too.
     
    jpatrick85, Oct 25, 2009 IP
  5. AsHinE

    AsHinE Well-Known Member

    Messages:
    240
    Likes Received:
    8
    Best Answers:
    1
    Trophy Points:
    138
    #5
    AsHinE, Oct 25, 2009 IP
  6. Brandon_R

    Brandon_R Peon

    Messages:
    330
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Here is a code i found that could possibly aid you in your troubles.

    function storeLink($url,$gathered_from) {
    	$query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
    	mysql_query($query) or die('Error, insert query failed');
    }
    
    $target_url = "http://www.merchantos.com/";
    $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
    
    // make the cURL request to $target_url
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
    curl_setopt($ch, CURLOPT_URL,$target_url);
    curl_setopt($ch, CURLOPT_FAILONERROR, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);
    $html= curl_exec($ch);
    if (!$html) {
    	echo "<br />cURL error number:" .curl_errno($ch);
    	echo "<br />cURL error:" . curl_error($ch);
    	exit;
    }
    
    // parse the html into a DOMDocument
    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    
    // grab all the on the page
    $xpath = new DOMXPath($dom);
    $hrefs = $xpath->evaluate("/html/body//a");
    
    for ($i = 0; $i < $hrefs->length; $i++) {
    	$href = $hrefs->item($i);
    	$url = $href->getAttribute('href');
    	storeLink($url,$target_url);
    	echo "<br />Link stored: $url";
    }
    PHP:
    Source: http://www.merchantos.com/makebeta/php/scraping-links-with-php/
     
    Brandon_R, Oct 25, 2009 IP
  7. JAY6390

    JAY6390 Peon

    Messages:
    918
    Likes Received:
    31
    Best Answers:
    0
    Trophy Points:
    0
    #7
    JAY6390, Oct 25, 2009 IP