Any one there help me to crawl this <a href="http://www.example1.com">contents1</a><br /> <a href="http://www.example2.com">contents2</a><br /> <a href="http://www.example3.com">contents3</a><br /> <a href="http://www.example4.com">contents4</a><br /> <a href="http://www.example5.com">contents5</a><br /> to insert database but only insert to database if its not existing record Is it possible ? Hoping help from some one
Just for clarification... you want to pull the links from a page and store the data in mysql? for pulling the data from the page you COULD use this HTTP Client Class called Snoopy http://sourceforge.net/projects/snoopy/ There are examples on the web for grabbing the links from a page. Saving the data to mysql is easy. <?php /* Database table SQL CREATE TABLE `links` ( `id` int(11) NOT NULL auto_increment, `url` varchar(255) NOT NULL, `content` varchar(255) NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `url` (`url`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; */ include("snoopy.class.php"); $dbhost = 'localhost'; $dbname = 'database_name'; $dbuser = 'username'; $dbpass = 'password'; $link = mysql_connect($dbhost, $dbuser, $dbpass); if (!$link) { die('Could not connect: ' . mysql_error()); } $db = mysql_select_db($dbname, $link); if (!$db) { die ('Can\'t use '.$dbname.' : ' . mysql_error()); } $snoopy = new Snoopy; // need an proxy?: //$snoopy->proxy_host = "my.proxy.host"; //$snoopy->proxy_port = "8080"; // set browser and referer: $snoopy->agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"; $snoopy->referer = "http://www.google.com/"; // set an raw-header: $snoopy->rawheaders["Pragma"] = "no-cache"; // set some internal variables: $snoopy->maxredirs = 2; $snoopy->offsiteok = false; $snoopy->expandlinks = true; // fetch the text of the website www.google.com: if($snoopy->fetchtext("http://www.google.com")){ $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; if(preg_match_all("/$regexp/siU", $snoopy->results, $matches, PREG_SET_ORDER)) { foreach($matches as $match) { # $match[2] = link address # $match[3] = link text // insert the link and text into the database. If the url already exists then basically do nothing $sql = sprintf("INSERT INTO links (url,content) VALUES ('%s','%s') ON DUPLICATE KEY UPDATE url='%s'", mysql_real_escape_string($match[2]), mysql_real_escape_string($match[3]), mysql_real_escape_string($match[2])); $rs = mysql_query($sql); } echo "Operation has completed."; } } else { print "Snoopy: error while fetching document: ".$snoopy->error."\n"; } ?> PHP:
You might have to make some changes to it... I didn't test it... I just modified this example I found on dnzone to do what you wanted... so there might be errors... I don't debug free scripts http://snippets.dzone.com/posts/show/2007
Ok... so I decided to test it and it wasn't working so I fixed it... a little... you will need to modify the script to expand the links correctly. <?php /* Database table SQL CREATE TABLE `links` ( `id` int(11) NOT NULL auto_increment, `url` varchar(255) NOT NULL, `content` varchar(255) NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `url` (`url`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; */ include("snoopy.class.php"); $dbhost = 'localhost'; $dbname = 'database_name'; $dbuser = 'username'; $dbpass = 'password'; $link = mysql_connect($dbhost, $dbuser, $dbpass); if (!$link) { die('Could not connect: ' . mysql_error()); } $db = mysql_select_db($dbname, $link); if (!$db) { die ('Can\'t use '.$dbname.' : ' . mysql_error()); } $snoopy = new Snoopy; // need an proxy?: //$snoopy->proxy_host = "my.proxy.host"; //$snoopy->proxy_port = "8080"; // set browser and referer: $snoopy->agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"; $snoopy->referer = "http://www.yahoo.com/"; // set an raw-header: $snoopy->rawheaders["Pragma"] = "no-cache"; // set some internal variables: $snoopy->maxredirs = 2; $snoopy->offsiteok = false; $snoopy->expandlinks = true; // fetch the text of the website www.yahoo.com: if($snoopy->fetch("http://www.yahoo.com")){ $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; if(preg_match_all("/$regexp/siU", $snoopy->results, $matches, PREG_SET_ORDER)) { foreach($matches as $match) { # $match[2] = link address # $match[3] = link text // insert the link and text into the database. If the url already exists then basically do nothing $sql = sprintf("INSERT INTO links (url,content) VALUES ('%s','%s') ON DUPLICATE KEY UPDATE url='%s'", mysql_real_escape_string($match[2]), mysql_real_escape_string($match[3]), mysql_real_escape_string($match[2])); $rs = mysql_query($sql); } echo "Operation has completed."; } } else { print "Snoopy: error while fetching document: ".$snoopy->error."\n"; } ?> PHP:
Sir,as a part of my engineering project..I have to design a tourism website in which I get all the data by crawling various sites and maintain a database of the same.There is no constraint on the language to be used..but I would prefer using either PHP or Perl. Please suggest..which of the above two should I use ( as in which does the same quickly and is easy to gulp )..and do mention some valid resources from where I can get an idea about the same. Regards - DT