get links from pages (to complicaded for me)

Discussion in 'PHP' started by xchris, Jul 12, 2008.

  1. #1
    Here is the thing. I have a text file with links from one domain separate by a newline

    like this

    http://domain.com/link1.html
    http://domain.com/link2.html
    http://domain.com/link3.html

    Now i need to load all the links from the file and for every link extract all the links that are on the link page

    explanation 2.
    So i need to take a link from a file -> get the content of the link -> preg_match_all all links on that page (internal and external) -> store them somewhere for later use -> go to the next link in the text file and get the content of the link -> preg_match_all all links on that page (internal and external).... when the script is finished checking all the links from the file i need it to write all the links preg matched in another file


    uff. i hope you understand what i want

    so now im little confused since im new to php. my question is what functions to use, do i create a array from the text file and put it somehow in the loop :confused::confused::confused:. I'm relay desperate. I'm working 2 days on this, and cant find even where to start, so some pointers would be nice, and the whole script would be like wining the lottery (you don't even need to try, i know you are busy)

    Thanks in advance for any help
     
    xchris, Jul 12, 2008 IP
  2. Danltn

    Danltn Well-Known Member

    Messages:
    679
    Likes Received:
    36
    Best Answers:
    0
    Trophy Points:
    120
    #2
    To put it bluntly, can you pay?

    I've already written a script for this, but never used it, or even sold it (yet,)

    Dan
     
    Danltn, Jul 12, 2008 IP
  3. mbreezy

    mbreezy Active Member

    Messages:
    135
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    55
    #3
    
    //open file and put links in an array
    $filename = "linklist.txt";
    $content = file($filename);
    
    //set up the loop to run through each link
    $i = 0;
    while($content)
    {
    //open the link and put the page into $file
    	$file = implode('', file($content[$i]));
    //run my little function, see below for it, very, very dirty
            $all_links = dig_all ('a href="', '"', $file); 
    //now all the links from the page are in the array $all_links
    
    //open up your output file
    $fh = fopen('whateverfile.txt', 'w') or die("can't open file");
    
    //lets go through that array
    $y = 0;
    do {
    //write it to the file with a line break in there
    	fwrite($fh, $all_links[$y] . '\n');
    	$y++;
    } while($all_links[$y]);
    
    }
    
    
    //a little function i use often to pull from a file all instances
    //it's a dirty regular expression finder
    function dig_all ($start_str, $end_str, $page, $i=0, $limit = 0) 
    {
    	$result = array();
    	$more = true;
    	do
    	{
    		$i++;
    		$data = explode($start_str, $page);
    		$data1 = explode($end_str, $data[$i]);
    		$data2 = $data1[0];
    		if (!$data2 || ($limit>0 && $i==$limit))
    			$more = false;
    		else $result[] = $data2;
    	} while ($more == true);
         	return $result;
    }
    
    
    Code (markup):
    That's it. I'm not good with regular expressions so I have my own function that finds what I need. You may want to research those a little bit and clean this up. There's better ways to recognize URLs on a page, much better ways.

    Enjoy. Go spider now.
     
    mbreezy, Jul 12, 2008 IP
  4. mbreezy

    mbreezy Active Member

    Messages:
    135
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    55
    #4
    Sorry about that, I had the code laying around just added in the notes. Except mine grabbed full sentences and altered words/punctuation... You can guess what that's for. lol
     
    mbreezy, Jul 12, 2008 IP
  5. Danltn

    Danltn Well-Known Member

    Messages:
    679
    Likes Received:
    36
    Best Answers:
    0
    Trophy Points:
    120
    #5
    Meh doesn't bother me.

    Here's mine:

    <?php
    
    /**
     * Danltn | http://danltn.com/
     * No warranty is given to code used
     */
    
    function get_all_urls($url = '', $curl = false)
    {
        if (!$url)
        {
            /* If no email provided, throw a warning */
            trigger_error('You must provide a URL', E_USER_WARNING);
            return array();
        }
        if ($curl and function_exists('curl_setopt') and function_exists('curl_init') and function_exists('curl_exec'))
        {
            /* If we have cURL set to true AND it all checks out */
            $curl = curl_init($url);
            curl_setopt($curl, CURLOPT_TIMEOUT, 60);
            curl_setopt($curl, CURLOPT_USERAGENT, 'Googlebot/2.1 (+http://www.google.com/bot.html)');
            curl_setopt($curl, CURLOPT_REFERER, 'http://www.google.com');
            /* Appear as Google */
            curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
            $page = curl_exec($curl);
            curl_close($curl);
        }
        else
        {
            $page = @file_get_contents($url);
        }
        $preg = array();
        $base = array();
        $parsed = parse_url($url);
    
        preg_match_all("/\<a(\s*)href(\s*)=(\s*)\"(.*?)\"(.*?)\>(.*?)\<\/a\>/i", $page, $preg[0]);
        preg_match_all("/\<a(\s*)href(\s*)=(\s*)'(.*?)'(.*?)\>(.*?)\<\/a\>/i", $page, $preg[1]);
        preg_match("/\<base(\s*)href(\s*)=(\s*)\"(.*?)\"(\s*)\/\>/i", $page, $base);
    
        $href = array_merge($preg[0][4], $preg[1][4]);
        $base = (!empty($base[4])) ? $base[4] : ((!empty($parsed['user'])) ? "{$parsed['scheme']}://{$parsed['user']}:{$parsed['pass']}@{$parsed['host']}" : "{$parsed['scheme']}://{$parsed['host']}");
    
        for ($i = 0, $counthref = count($href); $i < $counthref; $i++)
        {
            if (substr($href[$i], 0, 1) == '/') $href[$i] = "{$base}{$href[$i]}";
            if (substr($href[$i], 0, 1) == '?' || substr($href[$i], 0, 1) == '#') $href[$i] = "{$url}{$href[$i]}";
            if (substr($href[$i], 0, 7) != "http://") $href[$i] = "{$base}/{$href[$i]}";
            while (strstr($href[$i], "//")) $href[$i] = str_replace("//", "/", $href[$i]);
            $href[$i] = str_replace("http:/", "http://", $href[$i]);
        }
        return array_unique($href);
    }
    
    print_r(get_all_urls('http://danltn.com'));
    
    ?>
    PHP:
    It probably works better than yours (no offence meant) because it will fix relative URLs (e.g. /index.php) to the full URL (http://danltn.com/index.php) automagically.

    Dan
     
    Danltn, Jul 12, 2008 IP
  6. shallowink

    shallowink Well-Known Member

    Messages:
    1,218
    Likes Received:
    64
    Best Answers:
    2
    Trophy Points:
    150
    #6
    shallowink, Jul 12, 2008 IP
  7. mbreezy

    mbreezy Active Member

    Messages:
    135
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    55
    #7
    It's cool. Probably works better though. You use cURL which is much more effective than my implosion. PLUS you use preg_match_all. I like yours much better too. I was just trying to help hastily. lol
     
    mbreezy, Jul 12, 2008 IP
  8. xchris

    xchris Peon

    Messages:
    111
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Thank you very much, especially Danltn! Btw mbreezy are you a member of BHW? Your name sound familiar
     
    xchris, Jul 12, 2008 IP
  9. mbreezy

    mbreezy Active Member

    Messages:
    135
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    55
    #9
    What is BHW?
     
    mbreezy, Jul 12, 2008 IP
  10. Danltn

    Danltn Well-Known Member

    Messages:
    679
    Likes Received:
    36
    Best Answers:
    0
    Trophy Points:
    120
    #10
    I'd guess that's a no then :p
     
    Danltn, Jul 12, 2008 IP