PHP Crawl From Website

ruby90 Peon

Messages:: 721

Likes Received:: 7

Best Answers:: 0

Trophy Points:: 0

#1

Any one there help me to crawl this

<a href="http://www.example1.com">contents1</a> 
<a href="http://www.example2.com">contents2</a> 
<a href="http://www.example3.com">contents3</a> 
<a href="http://www.example4.com">contents4</a> 
<a href="http://www.example5.com">contents5</a> 

to insert database but only insert to database if its not existing record Is it possible ? Hoping help from some one

ruby90, Apr 19, 2008 IP

nation-x Peon

Messages:: 59

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#2

Just for clarification... you want to pull the links from a page and store the data in mysql?

for pulling the data from the page you COULD use this HTTP Client Class called Snoopy
http://sourceforge.net/projects/snoopy/

There are examples on the web for grabbing the links from a page. Saving the data to mysql is easy.


<?php
/* Database table SQL
CREATE TABLE `links` (
  `id` int(11) NOT NULL auto_increment,
  `url` varchar(255) NOT NULL,
  `content` varchar(255) NOT NULL,
  PRIMARY KEY  (`id`),
  UNIQUE KEY `url` (`url`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
*/
include("snoopy.class.php");

$dbhost = 'localhost';
$dbname = 'database_name';
$dbuser = 'username';
$dbpass = 'password';

$link = mysql_connect($dbhost, $dbuser, $dbpass);
if (!$link) {
    die('Could not connect: ' . mysql_error());
}

$db = mysql_select_db($dbname, $link);

if (!$db) {
    die ('Can\'t use '.$dbname.' : ' . mysql_error());
}

$snoopy = new Snoopy;

// need an proxy?:
//$snoopy->proxy_host = "my.proxy.host";
//$snoopy->proxy_port = "8080";

// set browser and referer:
$snoopy->agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)";
$snoopy->referer = "http://www.google.com/";

// set an raw-header:
$snoopy->rawheaders["Pragma"] = "no-cache";

// set some internal variables:
$snoopy->maxredirs = 2;
$snoopy->offsiteok = false;
$snoopy->expandlinks = true;

// fetch the text of the website www.google.com:
if($snoopy->fetchtext("http://www.google.com")){
       
		$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; 
		if(preg_match_all("/$regexp/siU", $snoopy->results, $matches, PREG_SET_ORDER)) { 
			foreach($matches as $match) { 
				# $match[2] = link address 
				# $match[3] = link text
				// insert the link and text into the database. If the url already exists then basically do nothing
				$sql = sprintf("INSERT INTO links (url,content) VALUES ('%s','%s') ON DUPLICATE KEY UPDATE url='%s'",
							         mysql_real_escape_string($match[2]),
									 mysql_real_escape_string($match[3]),
									 mysql_real_escape_string($match[2]));
				$rs = mysql_query($sql);
			}
			echo "Operation has completed.";
		}
}
else {
    print "Snoopy: error while fetching document: ".$snoopy->error."\n";
}
?>

PHP:

nation-x, Apr 19, 2008 IP

ruby90 likes this.

ruby90 Peon

Messages:: 721

Likes Received:: 7

Best Answers:: 0

Trophy Points:: 0

#3

Thanks nation-x

rep added

nation-x said: ↑

Just for clarification... you want to pull the links from a page and store the data in mysql?

for pulling the data from the page you COULD use this HTTP Client Class called Snoopy
http://sourceforge.net/projects/snoopy/

There are examples on the web for grabbing the links from a page. Saving the data to mysql is easy.


<?php
/* Database table SQL
CREATE TABLE `links` (
  `id` int(11) NOT NULL auto_increment,
  `url` varchar(255) NOT NULL,
  `content` varchar(255) NOT NULL,
  PRIMARY KEY  (`id`),
  UNIQUE KEY `url` (`url`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
*/
include("snoopy.class.php");

$dbhost = 'localhost';
$dbname = 'database_name';
$dbuser = 'username';
$dbpass = 'password';

$link = mysql_connect($dbhost, $dbuser, $dbpass);
if (!$link) {
    die('Could not connect: ' . mysql_error());
}

$db = mysql_select_db($dbname, $link);

if (!$db) {
    die ('Can\'t use '.$dbname.' : ' . mysql_error());
}

$snoopy = new Snoopy;

// need an proxy?:
//$snoopy->proxy_host = "my.proxy.host";
//$snoopy->proxy_port = "8080";

// set browser and referer:
$snoopy->agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)";
$snoopy->referer = "http://www.google.com/";

// set an raw-header:
$snoopy->rawheaders["Pragma"] = "no-cache";

// set some internal variables:
$snoopy->maxredirs = 2;
$snoopy->offsiteok = false;
$snoopy->expandlinks = true;

// fetch the text of the website www.google.com:
if($snoopy->fetchtext("http://www.google.com")){
       
		$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; 
		if(preg_match_all("/$regexp/siU", $snoopy->results, $matches, PREG_SET_ORDER)) { 
			foreach($matches as $match) { 
				# $match[2] = link address 
				# $match[3] = link text
				// insert the link and text into the database. If the url already exists then basically do nothing
				$sql = sprintf("INSERT INTO links (url,content) VALUES ('%s','%s') ON DUPLICATE KEY UPDATE url='%s'",
							         mysql_real_escape_string($match[2]),
									 mysql_real_escape_string($match[3]),
									 mysql_real_escape_string($match[2]));
				$rs = mysql_query($sql);
			}
			echo "Operation has completed.";
		}
}
else {
    print "Snoopy: error while fetching document: ".$snoopy->error."\n";
}
?>

PHP:

Click to expand...

ruby90, Apr 19, 2008 IP

nation-x Peon

Messages:: 59

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#4

You might have to make some changes to it... I didn't test it... I just modified this example I found on dnzone to do what you wanted... so there might be errors... I don't debug free scripts

http://snippets.dzone.com/posts/show/2007

nation-x, Apr 19, 2008 IP

nation-x Peon

Messages:: 59

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#5

Ok... so I decided to test it and it wasn't working so I fixed it... a little... you will need to modify the script to expand the links correctly.


<?php
/* Database table SQL
CREATE TABLE `links` (
  `id` int(11) NOT NULL auto_increment,
  `url` varchar(255) NOT NULL,
  `content` varchar(255) NOT NULL,
  PRIMARY KEY  (`id`),
  UNIQUE KEY `url` (`url`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
*/
include("snoopy.class.php");

$dbhost = 'localhost';
$dbname = 'database_name';
$dbuser = 'username';
$dbpass = 'password';

$link = mysql_connect($dbhost, $dbuser, $dbpass);
if (!$link) {
    die('Could not connect: ' . mysql_error());
}

$db = mysql_select_db($dbname, $link);

if (!$db) {
    die ('Can\'t use '.$dbname.' : ' . mysql_error());
}

$snoopy = new Snoopy;

// need an proxy?:
//$snoopy->proxy_host = "my.proxy.host";
//$snoopy->proxy_port = "8080";

// set browser and referer:
$snoopy->agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)";
$snoopy->referer = "http://www.yahoo.com/";

// set an raw-header:
$snoopy->rawheaders["Pragma"] = "no-cache";

// set some internal variables:
$snoopy->maxredirs = 2;
$snoopy->offsiteok = false;
$snoopy->expandlinks = true;

// fetch the text of the website www.yahoo.com:
if($snoopy->fetch("http://www.yahoo.com")){
       
        $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
        if(preg_match_all("/$regexp/siU", $snoopy->results, $matches, PREG_SET_ORDER)) {
            foreach($matches as $match) {
                # $match[2] = link address
                # $match[3] = link text
                // insert the link and text into the database. If the url already exists then basically do nothing
                $sql = sprintf("INSERT INTO links (url,content) VALUES ('%s','%s') ON DUPLICATE KEY UPDATE url='%s'",
                                     mysql_real_escape_string($match[2]),
                                     mysql_real_escape_string($match[3]),
                                     mysql_real_escape_string($match[2]));
                $rs = mysql_query($sql);
            }
            echo "Operation has completed.";
        }
}
else {
    print "Snoopy: error while fetching document: ".$snoopy->error."\n";
}
?>

PHP:

nation-x, Apr 19, 2008 IP

ruby90 Peon

Messages:: 721

Likes Received:: 7

Best Answers:: 0

Trophy Points:: 0

#6

Thanks again nation-x you're good

nation-x said: ↑

Ok... so I decided to test it and it wasn't working so I fixed it... a little... you will need to modify the script to expand the links correctly.


<?php
/* Database table SQL
CREATE TABLE `links` (
  `id` int(11) NOT NULL auto_increment,
  `url` varchar(255) NOT NULL,
  `content` varchar(255) NOT NULL,
  PRIMARY KEY  (`id`),
  UNIQUE KEY `url` (`url`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
*/
include("snoopy.class.php");

$dbhost = 'localhost';
$dbname = 'database_name';
$dbuser = 'username';
$dbpass = 'password';

$link = mysql_connect($dbhost, $dbuser, $dbpass);
if (!$link) {
    die('Could not connect: ' . mysql_error());
}

$db = mysql_select_db($dbname, $link);

if (!$db) {
    die ('Can\'t use '.$dbname.' : ' . mysql_error());
}

$snoopy = new Snoopy;

// need an proxy?:
//$snoopy->proxy_host = "my.proxy.host";
//$snoopy->proxy_port = "8080";

// set browser and referer:
$snoopy->agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)";
$snoopy->referer = "http://www.yahoo.com/";

// set an raw-header:
$snoopy->rawheaders["Pragma"] = "no-cache";

// set some internal variables:
$snoopy->maxredirs = 2;
$snoopy->offsiteok = false;
$snoopy->expandlinks = true;

// fetch the text of the website www.yahoo.com:
if($snoopy->fetch("http://www.yahoo.com")){
       
        $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
        if(preg_match_all("/$regexp/siU", $snoopy->results, $matches, PREG_SET_ORDER)) {
            foreach($matches as $match) {
                # $match[2] = link address
                # $match[3] = link text
                // insert the link and text into the database. If the url already exists then basically do nothing
                $sql = sprintf("INSERT INTO links (url,content) VALUES ('%s','%s') ON DUPLICATE KEY UPDATE url='%s'",
                                     mysql_real_escape_string($match[2]),
                                     mysql_real_escape_string($match[3]),
                                     mysql_real_escape_string($match[2]));
                $rs = mysql_query($sql);
            }
            echo "Operation has completed.";
        }
}
else {
    print "Snoopy: error while fetching document: ".$snoopy->error."\n";
}
?>

PHP:

Click to expand...

ruby90, Apr 19, 2008 IP

nation-x Peon

Messages:: 59

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#7

This is the test I ran

http://www.phpfoundry.com/test.php?url=http://www.yahoo.com

nation-x, Apr 19, 2008 IP

Thapar Peon

Messages:: 6

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#8

Sir,as a part of my engineering project..I have to design a tourism website in which I get all the data by crawling various sites and maintain a database of the same.There is no constraint on the language to be used..but I would prefer using either PHP or Perl.
Please suggest..which of the above two should I use ( as in which does the same quickly and is easy to gulp )..and do mention some valid resources from where I can get an idea about the same.

Regards -
DT

Thapar, Oct 18, 2009 IP

itrana123 Peon

Messages:: 177

Likes Received:: 2

Best Answers:: 0

Trophy Points:: 0

#9

Thanks nice work. It really helps

itrana123, Apr 20, 2010 IP

Log in or Sign up

PHP Crawl From Website

ruby90 Peon

nation-x Peon

ruby90 Peon

nation-x Peon

nation-x Peon

ruby90 Peon

nation-x Peon

Thapar Peon

itrana123 Peon

Useful Searches