Simple web crawler

Make a perfect site Well-Known Member

Messages:: 376

Likes Received:: 9

Best Answers:: 0

Trophy Points:: 155

#1

Hi all,

I am just developing a very simple web spider/crawler. Here is the code:
<?php

$seed = "http://www.akosblog.com";
$html = file_get_contents($seed);
echo "Page : " . $seed;
preg_match_all("/http:\/\/[^\"\s']+/", $html, $matches, PREG_SET_ORDER);

foreach ($matches as $val) {
echo "<br><font color=red>links :</font> " . $val[0] . "\r\n";


}
?>
PHP:
This code just gets all the links from the selected page. Now I want to move on, I want the spider to follow links and index another link and another.
So how could I do that?

Regards,
Akos

Make a perfect site, Jun 15, 2011 IP

idlecool Greenhorn

Messages:: 4

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 11

#2

nice post. I dont know much of PHP so it is like one more small step into it

btw, I think regex are bitches. you never know easily where you get wrong and still more PITA is when you need do some modifications in the old code. There are some excellent parsing libraries available in different languages. I think PHP has it too, well let me know if you find one , I basically code in Python and found `lxml`, it does its jobs awesomely you should look into it if you happen to have free time.

idlecool, Jun 16, 2011 IP

G3n3s!s Active Member

Messages:: 325

Likes Received:: 4

Best Answers:: 0

Trophy Points:: 80

#3

Put each link in database and in next phase, you'll have to select them all and do same proccess with each link.

G3n3s!s, Jun 16, 2011 IP

Log in or Sign up

Simple web crawler

Make a perfect site Well-Known Member

idlecool Greenhorn

G3n3s!s Active Member

Useful Searches