How to extract internal links of site?

Discussion in 'PHP' started by Freewebspace, Jun 16, 2007.

  1. #1
    Yes as the title says

    How I can extract internal links of a website from a particular page?

    only Internal links and no external links
     
    Freewebspace, Jun 16, 2007 IP
  2. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #2
    
    function fetch_links($url)
    {
    	if (!preg_match('/^https?:\/\/(\w+\.)?([^\/]+)/i', $url, $host))
    	{
    		trigger_error('Invalid URL given.');
    		return false;
    	}
    
    	if (preg_match_all('/<a.+href\s*=\s*["\']([^"\']+)[^>]*>.*?<\/a>/i', @file_get_contents($url), $links))
    	{
    		foreach (array_unique($links[1]) AS $index => $link)
    		{
    			$link = trim($link);
    
    			if (preg_match('/^(ht|f)tps?:\/\//i', $link) AND !preg_match('/^(ht|f)tps?:\/\/(\w+\.)?' . preg_quote($host[2], '/') .'/i', $link) OR $link[0] == '#')
    			{
    				unset($links[1][$index]);
    			}
    		}
    		
    		return $links[1];
    	}
    	
    	return false;
    }
    
    PHP:
    Usage example:
    
    echo '<pre>' . print_r(fetch_links('http://forums.digitalpoint.com/showthread.php?t=367847'), true) . '</pre>';
    
    PHP:
     
    nico_swd, Jun 16, 2007 IP
  3. Freewebspace

    Freewebspace Notable Member

    Messages:
    6,213
    Likes Received:
    370
    Best Answers:
    0
    Trophy Points:
    275
    #3
    I, replaced url by

    my site http://www.jeffbrowninc.com

    Result



    It does not seem to show all the internal links and how to do with a url if it has PHPsession ID
     
    Freewebspace, Jun 16, 2007 IP
  4. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #4
    Just a slight error in the pattern. This works for me.
    
    function fetch_links($url)
    {
        if (!preg_match('/^https?:\/\/(\w+\.)?([^\/]+)/i', $url, $host))
        {
            trigger_error('Invalid URL given.');
            return false;
        }
    
        if (preg_match_all('/<a.*?href\s*=\s*["\']([^"\']+)[^>]*>.*?<\/a>/i', @file_get_contents($url), $links))
        {
            foreach ($links[1] AS $index => $link)
            {
                $link = trim($link);
    
                if (preg_match('/^(ht|f)tps?:\/\//i', $link) AND !preg_match('/^(ht|f)tps?:\/\/(\w+\.)?' . preg_quote($host[2], '/') .'/i', $link) OR $link[0] == '#')
                {
    				unset($links[1][$index]);
                }
            }
           
            return array_unique($links[1]);
        }
       
        return false;
    }
    
    
    PHP:
    And what do you mean with the session ID? Do you want to remove it from the string?

    If so, try replacing this:
    
    $link = trim($link);
    
    PHP:
    With:
    
    $links[1][$index] = trim(preg_replace('/(&|\?)(s|PHPSESSID)=[a-f0-9]{32}/', null, html_entity_decode($link)));
    
    PHP:
     
    nico_swd, Jun 16, 2007 IP
    Freewebspace likes this.
  5. Freewebspace

    Freewebspace Notable Member

    Messages:
    6,213
    Likes Received:
    370
    Best Answers:
    0
    Trophy Points:
    275
    #5

    Yes it works fine!

    No problem with PHPsession ID's!
     
    Freewebspace, Jun 16, 2007 IP