PHP crawler script

WeedGrinch Active Member

Messages:: 1,236

Likes Received:: 73

Best Answers:: 0

Trophy Points:: 90

#1

I am looking for a very simple php script that crawls a site, and grabs all of the links. I can't find any good scripts, and I've tried at least 10.

If you could help me with this, or even guide me in the right direction (what functions, etc) I would appreciate it. Thanks.

WeedGrinch, Feb 15, 2008 IP

zerxer Peon

Messages:: 368

Likes Received:: 18

Best Answers:: 0

Trophy Points:: 0

#2

You want simple? Okay, here's a really simple one that I just whipped up for ya.
<?php
  $original_file = file_get_contents("http://www.domain.com");
  $stripped_file = strip_tags($original_file, "<a>");
  preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches);

  //DEBUGGING

  //$matches[0] now contains the complete A tags; ex: <a href="link">text</a>
  //$matches[1] now contains only the HREFs in the A tags; ex: link

  header("Content-type: text/plain"); //Set the content type to plain text so the print below is easy to read!
  print_r($matches); //View the array to see if it worked
?>
PHP:
You would remove everything after //DEBUGGING when actually using it though. It's just for you to see how it works if you put it in a PHP file by itself for testing.

Only took 3 lines.. not counting debugging.

zerxer, Feb 15, 2008 IP

hasan_889 Banned

Messages:: 303

Likes Received:: 7

Best Answers:: 0

Trophy Points:: 0

#3

It's just showing 1 page ... it's not crawling deep!

hasan_889, Feb 15, 2008 IP

zerxer Peon

Messages:: 368

Likes Received:: 18

Best Answers:: 0

Trophy Points:: 0

#4

Well, I only made it to rip the links off of one page. You can set it up so those 3 lines are in a function and it returns the $matches[1] array, and then you can loop through that array and call the script again for each link in that array value (which is a link itself) so that it keeps crawling.

If you'd like to see an example of what those 3 lines do, go to php.sitexero.net/?preview=link_crawler

zerxer, Feb 15, 2008 IP

redhits Notable Member

Messages:: 3,023

Likes Received:: 277

Best Answers:: 0

Trophy Points:: 255

#5

hasan_889 said: ↑

It's just showing 1 page ... it's not crawling deep!
Click to expand...

learn some php and create your own one

redhits, Feb 15, 2008 IP

zerxer Peon

Messages:: 368

Likes Received:: 18

Best Answers:: 0

Trophy Points:: 0

#6

softgroups said: ↑

learn some php and create your own one
Click to expand...

Yeah.. I did the hard part. Looping through them should be a breeze. Anyways, I'm still pretty bored right now and don't want to start on any of my major projects, so I guess I can quickly edit the example to dig X amount of times.

EDIT: Nevermind.. I was over halfway done when I decided that doing that was very unstable. It would take forever to load. I think my example on how to dig one page is enough. I made a few modifications to it, though. Check php.sitexero.net/?code=link_crawler

zerxer, Feb 15, 2008 IP

alimkb Member

Messages:: 27

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 36

#7

I think "PHP Crawler" would be useful . Its simple php based web crawler
sourceforge.net/projects/php-crawler/

alimkb, Feb 16, 2008 IP

Estevan Peon

Messages:: 120

Likes Received:: 8

Best Answers:: 1

Trophy Points:: 0

#8

hello
here is a simple example

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"http://www.urlyourstart.com");
curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout after 30 seconds
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$result=curl_exec ($ch);
curl_close ($ch);
// Search The Results From The Starting Site
if( $result )
{
// I LOOK ONLY FROM TOP domains change this for your usage
preg_match_all( '/<a href="(http:\/\/www.[^0-9].+?)"/', $result, $output, PREG_SET_ORDER );

foreach( $output as $item )

{
// ALL LINKS DISPLAY HERE
print_r($item);

// NOW YOU ADD IN YOU DATABASE AND MAKE A LOOP TO ENGINE NEVER STOP

}

}

maybe help you

Estevan, Feb 16, 2008 IP

WeedGrinch Active Member

Messages:: 1,236

Likes Received:: 73

Best Answers:: 0

Trophy Points:: 90

#9

I found something. Thanks to everyone who replied!

WeedGrinch, Feb 16, 2008 IP

indianseo Peon

Messages:: 208

Likes Received:: 11

Best Answers:: 0

Trophy Points:: 0

#10

Could you please what you found and how it is? Thank you.

indianseo, Feb 19, 2008 IP

CATTechnologies Guest

Messages:: 13

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#11

Use CURL Functions to get data and manipulate the data as you need
And get Links.

For more information: cattechnologies.com

CATTechnologies, Feb 21, 2008 IP

kaviarasankk Peon

Messages:: 17

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#12

Wow really working super
zerxer.. Gr8

kaviarasankk, Jul 14, 2010 IP

Deacalion Peon

Messages:: 438

Likes Received:: 11

Best Answers:: 0

Trophy Points:: 0

#13

ugh... seriously old thread. Necrophilia dude.

Deacalion, Jul 14, 2010 IP

kaviarasankk Peon

Messages:: 17

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#14

huh yeah your right
but its...

kaviarasankk, Jul 14, 2010 IP

themullet Member

Messages:: 110

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 26

#15

a nice bit of regex

themullet, Jul 14, 2010 IP

priyanka-mepco Peon

Messages:: 2

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#16

We are developing a search engine. for that we are in need of code for web crawler or something related to that, for getting automatically the websites, without manually entering it to database.

priyanka-mepco, Feb 28, 2011 IP

priyanka-mepco Peon

Messages:: 2

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#17

please help us as soon as possible, if you know......

priyanka-mepco, Feb 28, 2011 IP

aioarticles Peon

Messages:: 6

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#18

Can you using simple_html_dom.php?

aioarticles, Aug 25, 2011 IP

sojic Active Member

Messages:: 133

Likes Received:: 3

Best Answers:: 0

Trophy Points:: 90

#19

Avoid using simple_html_dom for crawling. It takes a lot of memory and the script crashes. Custom crawler using regex is the best.

sojic, Aug 25, 2011 IP

Balajink Peon

Messages:: 1

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#20

HI all,

i need a simple crawl script in php which to fetch the category,image,description,keywords,title,meta,price,mrp of a ecommerce website.. and store it in mysql database....so please reply me

Balajink, Dec 17, 2011 IP

Log in or Sign up

PHP crawler script

WeedGrinch Active Member

zerxer Peon

hasan_889 Banned

zerxer Peon

redhits Notable Member

zerxer Peon

alimkb Member

Estevan Peon

WeedGrinch Active Member

indianseo Peon

CATTechnologies Guest

kaviarasankk Peon

Deacalion Peon

kaviarasankk Peon

themullet Member

priyanka-mepco Peon

priyanka-mepco Peon

aioarticles Peon

sojic Active Member

Balajink Peon

Useful Searches