get links from pages (to complicaded for me)

xchris Peon

Messages:: 111

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#1

Here is the thing. I have a text file with links from one domain separate by a newline

like this

http://domain.com/link1.html
http://domain.com/link2.html
http://domain.com/link3.html

Now i need to load all the links from the file and for every link extract all the links that are on the link page

explanation 2.
So i need to take a link from a file -> get the content of the link -> preg_match_all all links on that page (internal and external) -> store them somewhere for later use -> go to the next link in the text file and get the content of the link -> preg_match_all all links on that page (internal and external).... when the script is finished checking all the links from the file i need it to write all the links preg matched in another file

uff. i hope you understand what i want

so now im little confused since im new to php. my question is what functions to use, do i create a array from the text file and put it somehow in the loop . I'm relay desperate. I'm working 2 days on this, and cant find even where to start, so some pointers would be nice, and the whole script would be like wining the lottery (you don't even need to try, i know you are busy)

Thanks in advance for any help

xchris, Jul 12, 2008 IP

Danltn Well-Known Member

Messages:: 679

Likes Received:: 36

Best Answers:: 0

Trophy Points:: 120

#2

To put it bluntly, can you pay?

I've already written a script for this, but never used it, or even sold it (yet,)

Dan

Danltn, Jul 12, 2008 IP

mbreezy Active Member

Messages:: 135

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 55

#3


//open file and put links in an array
$filename = "linklist.txt";
$content = file($filename);

//set up the loop to run through each link
$i = 0;
while($content)
{
//open the link and put the page into $file
	$file = implode('', file($content[$i]));
//run my little function, see below for it, very, very dirty
        $all_links = dig_all ('a href="', '"', $file); 
//now all the links from the page are in the array $all_links

//open up your output file
$fh = fopen('whateverfile.txt', 'w') or die("can't open file");

//lets go through that array
$y = 0;
do {
//write it to the file with a line break in there
	fwrite($fh, $all_links[$y] . '\n');
	$y++;
} while($all_links[$y]);

}


//a little function i use often to pull from a file all instances
//it's a dirty regular expression finder
function dig_all ($start_str, $end_str, $page, $i=0, $limit = 0) 
{
	$result = array();
	$more = true;
	do
	{
		$i++;
		$data = explode($start_str, $page);
		$data1 = explode($end_str, $data[$i]);
		$data2 = $data1[0];
		if (!$data2 || ($limit>0 && $i==$limit))
			$more = false;
		else $result[] = $data2;
	} while ($more == true);
     	return $result;
}

Code (markup):

That's it. I'm not good with regular expressions so I have my own function that finds what I need. You may want to research those a little bit and clean this up. There's better ways to recognize URLs on a page, much better ways.

Enjoy. Go spider now.

mbreezy, Jul 12, 2008 IP

mbreezy Active Member

Messages:: 135

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 55

#4

Danltn said: ↑

To put it bluntly, can you pay?

I've already written a script for this, but never used it, or even sold it (yet,)

Dan
Click to expand...

Sorry about that, I had the code laying around just added in the notes. Except mine grabbed full sentences and altered words/punctuation... You can guess what that's for. lol

mbreezy, Jul 12, 2008 IP

Danltn Well-Known Member

Messages:: 679

Likes Received:: 36

Best Answers:: 0

Trophy Points:: 120

#5

Meh doesn't bother me.

Here's mine:

<?php

/**
 * Danltn | http://danltn.com/
 * No warranty is given to code used
 */

function get_all_urls($url = '', $curl = false)
{
    if (!$url)
    {
        /* If no email provided, throw a warning */
        trigger_error('You must provide a URL', E_USER_WARNING);
        return array();
    }
    if ($curl and function_exists('curl_setopt') and function_exists('curl_init') and function_exists('curl_exec'))
    {
        /* If we have cURL set to true AND it all checks out */
        $curl = curl_init($url);
        curl_setopt($curl, CURLOPT_TIMEOUT, 60);
        curl_setopt($curl, CURLOPT_USERAGENT, 'Googlebot/2.1 (+http://www.google.com/bot.html)');
        curl_setopt($curl, CURLOPT_REFERER, 'http://www.google.com');
        /* Appear as Google */
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
        $page = curl_exec($curl);
        curl_close($curl);
    }
    else
    {
        $page = @file_get_contents($url);
    }
    $preg = array();
    $base = array();
    $parsed = parse_url($url);

    preg_match_all("/\<a(\s*)href(\s*)=(\s*)\"(.*?)\"(.*?)\>(.*?)\<\/a\>/i", $page, $preg[0]);
    preg_match_all("/\<a(\s*)href(\s*)=(\s*)'(.*?)'(.*?)\>(.*?)\<\/a\>/i", $page, $preg[1]);
    preg_match("/\<base(\s*)href(\s*)=(\s*)\"(.*?)\"(\s*)\/\>/i", $page, $base);

    $href = array_merge($preg[0][4], $preg[1][4]);
    $base = (!empty($base[4])) ? $base[4] : ((!empty($parsed['user'])) ? "{$parsed['scheme']}://{$parsed['user']}:{$parsed['pass']}@{$parsed['host']}" : "{$parsed['scheme']}://{$parsed['host']}");

    for ($i = 0, $counthref = count($href); $i < $counthref; $i++)
    {
        if (substr($href[$i], 0, 1) == '/') $href[$i] = "{$base}{$href[$i]}";
        if (substr($href[$i], 0, 1) == '?' || substr($href[$i], 0, 1) == '#') $href[$i] = "{$url}{$href[$i]}";
        if (substr($href[$i], 0, 7) != "http://") $href[$i] = "{$base}/{$href[$i]}";
        while (strstr($href[$i], "//")) $href[$i] = str_replace("//", "/", $href[$i]);
        $href[$i] = str_replace("http:/", "http://", $href[$i]);
    }
    return array_unique($href);
}

print_r(get_all_urls('http://danltn.com'));

?>

PHP:

It probably works better than yours (no offence meant) because it will fix relative URLs (e.g. /index.php) to the full URL (http://danltn.com/index.php) automagically.

Dan

Danltn, Jul 12, 2008 IP

shallowink Well-Known Member

Messages:: 1,218

Likes Received:: 64

Best Answers:: 2

Trophy Points:: 150

#6

Maybe this will help with the extraction portion:

http://www.web-max.ca/PHP/misc_23.php
Here's a resource that appears to cover the topic better:
http://w-shadow.com/blog/2007/07/16/how-to-extract-all-urls-from-a-page-using-php/

shallowink, Jul 12, 2008 IP

mbreezy Active Member

Messages:: 135

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 55

#7

Danltn said: ↑

Meh doesn't bother me.

It probably works better than yours (no offence meant) because it will fix relative URLs

Dan
Click to expand...

It's cool. Probably works better though. You use cURL which is much more effective than my implosion. PLUS you use preg_match_all. I like yours much better too. I was just trying to help hastily. lol

mbreezy, Jul 12, 2008 IP

xchris Peon

Messages:: 111

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#8

Thank you very much, especially Danltn! Btw mbreezy are you a member of BHW? Your name sound familiar

xchris, Jul 12, 2008 IP

mbreezy Active Member

Messages:: 135

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 55

#9

What is BHW?

mbreezy, Jul 12, 2008 IP

Danltn Well-Known Member

Messages:: 679

Likes Received:: 36

Best Answers:: 0

Trophy Points:: 120

#10

I'd guess that's a no then

Danltn, Jul 12, 2008 IP

Log in or Sign up

get links from pages (to complicaded for me)

xchris Peon

Danltn Well-Known Member

mbreezy Active Member

mbreezy Active Member

Danltn Well-Known Member

shallowink Well-Known Member

mbreezy Active Member

xchris Peon

mbreezy Active Member

Danltn Well-Known Member

Useful Searches