Here is the thing. I have a text file with links from one domain separate by a newline like this http://domain.com/link1.html http://domain.com/link2.html http://domain.com/link3.html Now i need to load all the links from the file and for every link extract all the links that are on the link page explanation 2. So i need to take a link from a file -> get the content of the link -> preg_match_all all links on that page (internal and external) -> store them somewhere for later use -> go to the next link in the text file and get the content of the link -> preg_match_all all links on that page (internal and external).... when the script is finished checking all the links from the file i need it to write all the links preg matched in another file uff. i hope you understand what i want so now im little confused since im new to php. my question is what functions to use, do i create a array from the text file and put it somehow in the loop . I'm relay desperate. I'm working 2 days on this, and cant find even where to start, so some pointers would be nice, and the whole script would be like wining the lottery (you don't even need to try, i know you are busy) Thanks in advance for any help
To put it bluntly, can you pay? I've already written a script for this, but never used it, or even sold it (yet,) Dan
//open file and put links in an array $filename = "linklist.txt"; $content = file($filename); //set up the loop to run through each link $i = 0; while($content) { //open the link and put the page into $file $file = implode('', file($content[$i])); //run my little function, see below for it, very, very dirty $all_links = dig_all ('a href="', '"', $file); //now all the links from the page are in the array $all_links //open up your output file $fh = fopen('whateverfile.txt', 'w') or die("can't open file"); //lets go through that array $y = 0; do { //write it to the file with a line break in there fwrite($fh, $all_links[$y] . '\n'); $y++; } while($all_links[$y]); } //a little function i use often to pull from a file all instances //it's a dirty regular expression finder function dig_all ($start_str, $end_str, $page, $i=0, $limit = 0) { $result = array(); $more = true; do { $i++; $data = explode($start_str, $page); $data1 = explode($end_str, $data[$i]); $data2 = $data1[0]; if (!$data2 || ($limit>0 && $i==$limit)) $more = false; else $result[] = $data2; } while ($more == true); return $result; } Code (markup): That's it. I'm not good with regular expressions so I have my own function that finds what I need. You may want to research those a little bit and clean this up. There's better ways to recognize URLs on a page, much better ways. Enjoy. Go spider now.
Sorry about that, I had the code laying around just added in the notes. Except mine grabbed full sentences and altered words/punctuation... You can guess what that's for. lol
Meh doesn't bother me. Here's mine: <?php /** * Danltn | http://danltn.com/ * No warranty is given to code used */ function get_all_urls($url = '', $curl = false) { if (!$url) { /* If no email provided, throw a warning */ trigger_error('You must provide a URL', E_USER_WARNING); return array(); } if ($curl and function_exists('curl_setopt') and function_exists('curl_init') and function_exists('curl_exec')) { /* If we have cURL set to true AND it all checks out */ $curl = curl_init($url); curl_setopt($curl, CURLOPT_TIMEOUT, 60); curl_setopt($curl, CURLOPT_USERAGENT, 'Googlebot/2.1 (+http://www.google.com/bot.html)'); curl_setopt($curl, CURLOPT_REFERER, 'http://www.google.com'); /* Appear as Google */ curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); $page = curl_exec($curl); curl_close($curl); } else { $page = @file_get_contents($url); } $preg = array(); $base = array(); $parsed = parse_url($url); preg_match_all("/\<a(\s*)href(\s*)=(\s*)\"(.*?)\"(.*?)\>(.*?)\<\/a\>/i", $page, $preg[0]); preg_match_all("/\<a(\s*)href(\s*)=(\s*)'(.*?)'(.*?)\>(.*?)\<\/a\>/i", $page, $preg[1]); preg_match("/\<base(\s*)href(\s*)=(\s*)\"(.*?)\"(\s*)\/\>/i", $page, $base); $href = array_merge($preg[0][4], $preg[1][4]); $base = (!empty($base[4])) ? $base[4] : ((!empty($parsed['user'])) ? "{$parsed['scheme']}://{$parsed['user']}:{$parsed['pass']}@{$parsed['host']}" : "{$parsed['scheme']}://{$parsed['host']}"); for ($i = 0, $counthref = count($href); $i < $counthref; $i++) { if (substr($href[$i], 0, 1) == '/') $href[$i] = "{$base}{$href[$i]}"; if (substr($href[$i], 0, 1) == '?' || substr($href[$i], 0, 1) == '#') $href[$i] = "{$url}{$href[$i]}"; if (substr($href[$i], 0, 7) != "http://") $href[$i] = "{$base}/{$href[$i]}"; while (strstr($href[$i], "//")) $href[$i] = str_replace("//", "/", $href[$i]); $href[$i] = str_replace("http:/", "http://", $href[$i]); } return array_unique($href); } print_r(get_all_urls('http://danltn.com')); ?> PHP: It probably works better than yours (no offence meant) because it will fix relative URLs (e.g. /index.php) to the full URL (http://danltn.com/index.php) automagically. Dan
Maybe this will help with the extraction portion: http://www.web-max.ca/PHP/misc_23.php Here's a resource that appears to cover the topic better: http://w-shadow.com/blog/2007/07/16/how-to-extract-all-urls-from-a-page-using-php/
It's cool. Probably works better though. You use cURL which is much more effective than my implosion. PLUS you use preg_match_all. I like yours much better too. I was just trying to help hastily. lol
Thank you very much, especially Danltn! Btw mbreezy are you a member of BHW? Your name sound familiar