clean and fix urls

Discussion in 'PHP' started by stephan2307, Nov 17, 2010.

  1. #1
    I have a small crawler script which works really nice. I have only a problem with url's

    My crawler will extract all links from a page. Next I need to clean the list up. I am removing any that start with mailto:, skype:, javascript: and #

    now I am left with urls that would look like this

    http://www.domain.com
    http://www.domain.com/
    http://www.domain.com/index.php
    http://www.domain.com/subdir/index.php
    ../index.php
    ../../index.php
    index.php
    /index.php


    How can I clean them up so that they will all start with http:// and don't break??

    Any help?
     
    stephan2307, Nov 17, 2010 IP
  2. KingOle

    KingOle Peon

    Messages:
    69
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Prefix them with http and server name ?

    
    $var = "http://" . $_SERVER['SERVER_NAME'] . $yourlink;
    
    PHP:
     
    KingOle, Nov 17, 2010 IP
  3. samyak

    samyak Active Member

    Messages:
    280
    Likes Received:
    7
    Best Answers:
    4
    Trophy Points:
    90
    #3
    This is a clumsy but working code that cleans the URL they way you want.

    
    function cleanURLs($crawled_url, $raw_links)
    {
    	$crawled_url_details = parse_url($crawled_url);
    	$crawled_url_paths = explode("/",ltrim($crawled_url_details['path'], "/"));
    	$clean_link = array();
    	$path_depth = count($crawled_url_paths);
    	foreach($raw_links as $url)
    	{
    		if(preg_match("/http:\/\/(.*)/", $url))
    			$clean_link[] = $url;
    		elseif(preg_match("/^\/(.*)/", $url))
    			$clean_link[] = $crawled_url_details['scheme']."://".$crawled_url_details['host']."".$url;
    		else
    		{
    			
    			$url_arr = explode("/", $url);
    			$real_path_depth = $path_depth;
    			$required_url_path = array();
    			foreach($url_arr as $url_part)
    			{
    				if($url_part =="..")
    					$real_path_depth--;
    				elseif($url_part!=".")
    					$required_url_path[] = $url_part;
    					
    			}
    			$file_name= implode("/", $required_url_path);
    			$new_url_array= array();
    			$new_url_array[] = $crawled_url_details['scheme']."://".$crawled_url_details['host'];
    			for($i=0;$i<$real_path_depth; $i++)
    				$new_url_array[] = $crawled_url_paths[$i];
    			$new_url_array[] = $file_name;
    			$clean_link[] = implode("/", $new_url_array);
    		}
    	
    	}
    	return ($clean_link);
    } ?>
    PHP:
    Usage:
    <?php
    
    $crawled_url = "http://www.domain.com/sub1/sub2";
    $raw_links = array();
    $raw_links[]="http://www.domain.com";
    $raw_links[]="http://www.domain.com/";
    $raw_links[]="http://www.domain.com/index.php";
    $raw_links[]="http://www.domain.com/subdir/index.php";
    $raw_links[]="./index.php";
    $raw_links[]="../index.php";
    $raw_links[]="../../index.php";
    $raw_links[]="index.php";
    $raw_links[]="/index.php";
    
    $clean_links = cleanURLs($crawled_url, $raw_links);
    echo "<table>";
    for($i=0; $i<count($raw_links); $i++)
    {
    	echo "<tr><td>".$raw_links[$i]." </td><td> ".$clean_links[$i]."</td></tr>";
    }
    echo "</table>";
    ?>
    PHP:
     
    samyak, Nov 18, 2010 IP
  4. stephan2307

    stephan2307 Well-Known Member

    Messages:
    1,277
    Likes Received:
    33
    Best Answers:
    7
    Trophy Points:
    150
    #4
    wow thank you. exactly what I needed.
     
    stephan2307, Nov 18, 2010 IP
  5. samyak

    samyak Active Member

    Messages:
    280
    Likes Received:
    7
    Best Answers:
    4
    Trophy Points:
    90
    #5
    You are welcome :)
     
    samyak, Nov 18, 2010 IP
  6. SterlingS

    SterlingS Greenhorn

    Messages:
    29
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    13
    #6
    @samyak nice script!
     
    SterlingS, Nov 21, 2010 IP