Fetching links from a URL (page).

Discussion in 'PHP' started by AHA7, Apr 27, 2007.

  1. #1
    Hello,

    I am still a beginner with PHP and I want some help on this issue, please!

    I want a script that opens an external URL and extracts all the links from that URL (i.e. an HTML page) then outputs the links to the current document. But the external URL contains relative links so that I want it to convert them to absolute links and also I want to give them custom anchor texts. OK here's an example:

    The external URL (page) is: "http://external.com/linkpage?pageid=7". This page contains two links:
    1. <a href="/photos?ID=5&order=alpha">See more photos here</a>, and
    2. <a href="/about.php">About</a>
    Now I want my script to visit the URL "http://external.com/linkpage?pageid=7", extract the first link from href in the first <a> (link) tag, convert it to absolute by adding "http://external.com" before it (I can give it this, so it doesn't need to guess it), and then outputs this link to my page with an anchor text of my choice. And the same for the second link.
    So, I want the final output of the script in this example to be like this:
    1. <a href="http://external.com/photos?ID=5&order=alpha">I set this anchor text</a>, and
    2. <a href="http://external.com/about.php">My anchor text</a>.

    Any help?
     
    AHA7, Apr 27, 2007 IP
  2. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #2
    Here's a part of a class that I wrote some time ago. I modified it for your needs, I think. I didn't test it after the changes though:

    $url should be the link like: /photos?ID=5&order=alpha
    $fullurl should be the current working directory, like: http://external.com


    
    	function construct_absolute_url($url, $fullurl)
    	{
    		$url = trim($url);
    		
    		if (preg_match('/^https?:\/\//i', $url))
    		{
    			return $url;
    		}
    		
    		preg_match('/^https?:\/\/[^\/]+/i', $url, $mainurl);
    		$filename = basename($url);
    
    		if ($url[0] == '/')
    		{
    			$url = '/' . $filename;
    		}
    		else if (preg_match('/^\.\//', $url) OR $filename == $url)
    		{
    			$url = (substr($url, -1) == '/' ? null : '/') . $filename;
    		}
    		else if (preg_match('/^\.\.\//', $url))
    		{
    			$dirs = preg_split('/\//', str_replace($mainurl[0] .'/', null, $url), -1, PREG_SPLIT_NO_EMPTY);
    			$dirsback = preg_match_all('/\.\.\//', $url, $dummy);
    			$dirs = array_splice($dirs, $dirsback);
    			$url = '/' . ($dirs ? implode('/', $dirs) . '/' : null) . $filename;
    		}
    		
    		return $mainurl[0] . $url;
    	}
    
    PHP:
     
    nico_swd, Apr 27, 2007 IP
  3. AHA7

    AHA7 Peon

    Messages:
    445
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Thanks for replying, nico_swd! But I don't think your script does what I want to do.
    It does not fetch the links from an external URL (page) (My first post has more details), does it???
     
    AHA7, Apr 27, 2007 IP
  4. lwbbs

    lwbbs Well-Known Member

    Messages:
    331
    Likes Received:
    13
    Best Answers:
    0
    Trophy Points:
    108
    #4
    You can do it based on nico_swd's code.
    But you need check the HTML base tag, if base tag isn't set,
    use the input url as base url. Then according the base url to
    get the full url.

     
    lwbbs, Apr 27, 2007 IP
  5. AHA7

    AHA7 Peon

    Messages:
    445
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #5
    OK, I think I've written something. :eek: But will it do what I want it to?

    $page = fopen("http://external.com/linkpage?pageid=7", "r");
    $linkStart = explode("href=\"", $page); //Split the page into strings.
    
    $linkCount = count($linkStart)-1;  //The number of strings indicates the number of href='s (links) found except the first string (normally).
    
    for($i=1; $i<=$linkCount; $i++)
    {
    ${relLink.$i} = explode("\"", $linkStart); //The first string (element) of each relLink array should contain the relative link.
    }
    
    for($i=1; $i<=$linkCount; $i++)
    {
    echo "<a href=\"http://external.com" . ${relLink.$i}[0] . "\">Link #" . $i . "<\/a>";
    }
    PHP:
     
    AHA7, Apr 27, 2007 IP
  6. AHA7

    AHA7 Peon

    Messages:
    445
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Can someone please tell me what's wrong with the following code:

    for($i=1; $i<=5; $i++)
    {
    ${string2.$i} = explode("\"", $string1);
    }

    for($i=1; $i<=5; $i++)
    {
    echo ${relLink.$i}[0];
    }

    I guess the problem is with those variables highlighted in red. What did I do wrong?
     
    AHA7, Apr 27, 2007 IP
  7. decepti0n

    decepti0n Peon

    Messages:
    519
    Likes Received:
    16
    Best Answers:
    0
    Trophy Points:
    0
    #7
    If you have PHP5, you can use the DomDocument class
    $url = 'http://www.google.com';
    echo '<h1>URL: '.$url.'</h1>';
    $n = new DomDocument();
    @$n->loadHTMLFile($url);
    foreach ($n->getElementsByTagName('a') as $o) {	
    $construct .= '<li><a href="'.$url.$o->getAttribute('href').'">meh anchor</a></li>'; 	
    }
    echo '<ol>' . $construct . '</ol>'; // Final
    PHP:
    You can combine it with a form to add your own anchor text
     
    decepti0n, Apr 27, 2007 IP