PHP Crawl From Website

Discussion in 'PHP' started by ruby90, Apr 19, 2008.

  1. #1
    Any one there help me to crawl this

    <a href="http://www.example1.com">contents1</a><br />
    <a href="http://www.example2.com">contents2</a><br />
    <a href="http://www.example3.com">contents3</a><br />
    <a href="http://www.example4.com">contents4</a><br />
    <a href="http://www.example5.com">contents5</a><br />

    to insert database but only insert to database if its not existing record Is it possible ? Hoping help from some one :)
     
    ruby90, Apr 19, 2008 IP
  2. nation-x

    nation-x Peon

    Messages:
    59
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Just for clarification... you want to pull the links from a page and store the data in mysql?

    for pulling the data from the page you COULD use this HTTP Client Class called Snoopy
    http://sourceforge.net/projects/snoopy/

    There are examples on the web for grabbing the links from a page. Saving the data to mysql is easy.

    
    <?php
    /* Database table SQL
    CREATE TABLE `links` (
      `id` int(11) NOT NULL auto_increment,
      `url` varchar(255) NOT NULL,
      `content` varchar(255) NOT NULL,
      PRIMARY KEY  (`id`),
      UNIQUE KEY `url` (`url`)
    ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
    */
    include("snoopy.class.php");
    
    $dbhost = 'localhost';
    $dbname = 'database_name';
    $dbuser = 'username';
    $dbpass = 'password';
    
    $link = mysql_connect($dbhost, $dbuser, $dbpass);
    if (!$link) {
        die('Could not connect: ' . mysql_error());
    }
    
    $db = mysql_select_db($dbname, $link);
    
    if (!$db) {
        die ('Can\'t use '.$dbname.' : ' . mysql_error());
    }
    
    $snoopy = new Snoopy;
    
    // need an proxy?:
    //$snoopy->proxy_host = "my.proxy.host";
    //$snoopy->proxy_port = "8080";
    
    // set browser and referer:
    $snoopy->agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)";
    $snoopy->referer = "http://www.google.com/";
    
    // set an raw-header:
    $snoopy->rawheaders["Pragma"] = "no-cache";
    
    // set some internal variables:
    $snoopy->maxredirs = 2;
    $snoopy->offsiteok = false;
    $snoopy->expandlinks = true;
    
    // fetch the text of the website www.google.com:
    if($snoopy->fetchtext("http://www.google.com")){
           
    		$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; 
    		if(preg_match_all("/$regexp/siU", $snoopy->results, $matches, PREG_SET_ORDER)) { 
    			foreach($matches as $match) { 
    				# $match[2] = link address 
    				# $match[3] = link text
    				// insert the link and text into the database. If the url already exists then basically do nothing
    				$sql = sprintf("INSERT INTO links (url,content) VALUES ('%s','%s') ON DUPLICATE KEY UPDATE url='%s'",
    							         mysql_real_escape_string($match[2]),
    									 mysql_real_escape_string($match[3]),
    									 mysql_real_escape_string($match[2]));
    				$rs = mysql_query($sql);
    			}
    			echo "Operation has completed.";
    		}
    }
    else {
        print "Snoopy: error while fetching document: ".$snoopy->error."\n";
    }
    ?>
    
    PHP:
     
    nation-x, Apr 19, 2008 IP
    ruby90 likes this.
  3. ruby90

    ruby90 Peon

    Messages:
    721
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Thanks nation-x :) rep added

     
    ruby90, Apr 19, 2008 IP
  4. nation-x

    nation-x Peon

    Messages:
    59
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #4
    You might have to make some changes to it... I didn't test it... I just modified this example I found on dnzone to do what you wanted... so there might be errors... I don't debug free scripts :)

    http://snippets.dzone.com/posts/show/2007
     
    nation-x, Apr 19, 2008 IP
  5. nation-x

    nation-x Peon

    Messages:
    59
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Ok... so I decided to test it and it wasn't working so I fixed it... a little... you will need to modify the script to expand the links correctly.

    
    <?php
    /* Database table SQL
    CREATE TABLE `links` (
      `id` int(11) NOT NULL auto_increment,
      `url` varchar(255) NOT NULL,
      `content` varchar(255) NOT NULL,
      PRIMARY KEY  (`id`),
      UNIQUE KEY `url` (`url`)
    ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
    */
    include("snoopy.class.php");
    
    $dbhost = 'localhost';
    $dbname = 'database_name';
    $dbuser = 'username';
    $dbpass = 'password';
    
    $link = mysql_connect($dbhost, $dbuser, $dbpass);
    if (!$link) {
        die('Could not connect: ' . mysql_error());
    }
    
    $db = mysql_select_db($dbname, $link);
    
    if (!$db) {
        die ('Can\'t use '.$dbname.' : ' . mysql_error());
    }
    
    $snoopy = new Snoopy;
    
    // need an proxy?:
    //$snoopy->proxy_host = "my.proxy.host";
    //$snoopy->proxy_port = "8080";
    
    // set browser and referer:
    $snoopy->agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)";
    $snoopy->referer = "http://www.yahoo.com/";
    
    // set an raw-header:
    $snoopy->rawheaders["Pragma"] = "no-cache";
    
    // set some internal variables:
    $snoopy->maxredirs = 2;
    $snoopy->offsiteok = false;
    $snoopy->expandlinks = true;
    
    // fetch the text of the website www.yahoo.com:
    if($snoopy->fetch("http://www.yahoo.com")){
           
            $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
            if(preg_match_all("/$regexp/siU", $snoopy->results, $matches, PREG_SET_ORDER)) {
                foreach($matches as $match) {
                    # $match[2] = link address
                    # $match[3] = link text
                    // insert the link and text into the database. If the url already exists then basically do nothing
                    $sql = sprintf("INSERT INTO links (url,content) VALUES ('%s','%s') ON DUPLICATE KEY UPDATE url='%s'",
                                         mysql_real_escape_string($match[2]),
                                         mysql_real_escape_string($match[3]),
                                         mysql_real_escape_string($match[2]));
                    $rs = mysql_query($sql);
                }
                echo "Operation has completed.";
            }
    }
    else {
        print "Snoopy: error while fetching document: ".$snoopy->error."\n";
    }
    ?>
    
    PHP:
     
    nation-x, Apr 19, 2008 IP
  6. ruby90

    ruby90 Peon

    Messages:
    721
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Thanks again nation-x you're good :)

     
    ruby90, Apr 19, 2008 IP
  7. nation-x

    nation-x Peon

    Messages:
    59
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
  8. Thapar

    Thapar Peon

    Messages:
    6
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Sir,as a part of my engineering project..I have to design a tourism website in which I get all the data by crawling various sites and maintain a database of the same.There is no constraint on the language to be used..but I would prefer using either PHP or Perl.
    Please suggest..which of the above two should I use ( as in which does the same quickly and is easy to gulp )..and do mention some valid resources from where I can get an idea about the same.

    Regards -
    DT
     
    Thapar, Oct 18, 2009 IP
  9. itrana123

    itrana123 Peon

    Messages:
    177
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #9
    Thanks nice work. It really helps:)
     
    itrana123, Apr 20, 2010 IP