I have developed a broken link checker that works great, unless the URLs don't have the base href in them. For example: If the links are ...href="http://www.somesite.com/somepage.html"... is works great But if they are ...href="somepage.html"..., ...href="/somepage.html"..., or...href="./somepage.html"... it ignors them Here's the problem code: $matches = array(); preg_match_all("|href\=\"?'?`?([[:alnum:]:?=&@/;._-]+)\"?'?`?|i", $html, $matches); $links = array(); $ret = $matches[1]; for($i=0;isset($ret[$i]);$i++) { if(preg_match("|^http://(.*)|i", $ret[ $i])) { $links[] = $ret[$i]; } elseif(preg_match("|^(.*)|i", $ret[$i])) { $links[] = "http://".$info["host"]."". $ret[$i]; } } return $links; } Code (markup): I thought } elseif(preg_match("|^(.*)|i", $ret[$i])) { $links[] = "http://".$info["host"]."". $ret[$i]; would have taken care if it. Please help! Many Thanks.
The problem here is that it's hard to debug regex just by looking at it, however can I suggest you try using standard functions where ever possible for example for checking if there is an http:// in the url you could use strpos. Here is a simple function which (if tested) would probably cover most urls but for urls that start with ./ they are probably within a subfolder below the root so how can you work out their full URL? function format_url($url, $domain_url){ if( strpos($url, $domain_url) !== false ) { /* Remove first slash from url if present */ if( substr(0,1,$url) == '/' ) { $url = substr ( 1, count($url), $url); } /* Append domain URL to URL*/ return $domain_url . $url; } } PHP: Dunno if this helps.