I have a small crawler script which works really nice. I have only a problem with url's My crawler will extract all links from a page. Next I need to clean the list up. I am removing any that start with mailto:, skype:, javascript: and # now I am left with urls that would look like this http://www.domain.com http://www.domain.com/ http://www.domain.com/index.php http://www.domain.com/subdir/index.php ../index.php ../../index.php index.php /index.php How can I clean them up so that they will all start with http:// and don't break?? Any help?
This is a clumsy but working code that cleans the URL they way you want. function cleanURLs($crawled_url, $raw_links) { $crawled_url_details = parse_url($crawled_url); $crawled_url_paths = explode("/",ltrim($crawled_url_details['path'], "/")); $clean_link = array(); $path_depth = count($crawled_url_paths); foreach($raw_links as $url) { if(preg_match("/http:\/\/(.*)/", $url)) $clean_link[] = $url; elseif(preg_match("/^\/(.*)/", $url)) $clean_link[] = $crawled_url_details['scheme']."://".$crawled_url_details['host']."".$url; else { $url_arr = explode("/", $url); $real_path_depth = $path_depth; $required_url_path = array(); foreach($url_arr as $url_part) { if($url_part =="..") $real_path_depth--; elseif($url_part!=".") $required_url_path[] = $url_part; } $file_name= implode("/", $required_url_path); $new_url_array= array(); $new_url_array[] = $crawled_url_details['scheme']."://".$crawled_url_details['host']; for($i=0;$i<$real_path_depth; $i++) $new_url_array[] = $crawled_url_paths[$i]; $new_url_array[] = $file_name; $clean_link[] = implode("/", $new_url_array); } } return ($clean_link); } ?> PHP: Usage: <?php $crawled_url = "http://www.domain.com/sub1/sub2"; $raw_links = array(); $raw_links[]="http://www.domain.com"; $raw_links[]="http://www.domain.com/"; $raw_links[]="http://www.domain.com/index.php"; $raw_links[]="http://www.domain.com/subdir/index.php"; $raw_links[]="./index.php"; $raw_links[]="../index.php"; $raw_links[]="../../index.php"; $raw_links[]="index.php"; $raw_links[]="/index.php"; $clean_links = cleanURLs($crawled_url, $raw_links); echo "<table>"; for($i=0; $i<count($raw_links); $i++) { echo "<tr><td>".$raw_links[$i]." </td><td> ".$clean_links[$i]."</td></tr>"; } echo "</table>"; ?> PHP: