1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

PHP Crawl From Website

Discussion in 'PHP' started by ruby90, Apr 19, 2008.

  1. #1
    Any one there help me to crawl this

    <a href="http://www.example1.com">contents1</a><br />
    <a href="http://www.example2.com">contents2</a><br />
    <a href="http://www.example3.com">contents3</a><br />
    <a href="http://www.example4.com">contents4</a><br />
    <a href="http://www.example5.com">contents5</a><br />

    to insert database but only insert to database if its not existing record Is it possible ? Hoping help from some one :)
     
    ruby90, Apr 19, 2008 IP
  2. nation-x

    nation-x Peon

    Messages:
    59
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Just for clarification... you want to pull the links from a page and store the data in mysql?

    for pulling the data from the page you COULD use this HTTP Client Class called Snoopy
    http://sourceforge.net/projects/snoopy/

    There are examples on the web for grabbing the links from a page. Saving the data to mysql is easy.

    
    <?php
    /* Database table SQL
    CREATE TABLE `links` (
      `id` int(11) NOT NULL auto_increment,
      `url` varchar(255) NOT NULL,
      `content` varchar(255) NOT NULL,
      PRIMARY KEY  (`id`),
      UNIQUE KEY `url` (`url`)
    ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
    */
    include("snoopy.class.php");
    
    $dbhost = 'localhost';
    $dbname = 'database_name';
    $dbuser = 'username';
    $dbpass = 'password';
    
    $link = mysql_connect($dbhost, $dbuser, $dbpass);
    if (!$link) {
        die('Could not connect: ' . mysql_error());
    }
    
    $db = mysql_select_db($dbname, $link);
    
    if (!$db) {
        die ('Can\'t use '.$dbname.' : ' . mysql_error());
    }
    
    $snoopy = new Snoopy;
    
    // need an proxy?:
    //$snoopy->proxy_host = "my.proxy.host";
    //$snoopy->proxy_port = "8080";
    
    // set browser and referer:
    $snoopy->agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)";
    $snoopy->referer = "http://www.google.com/";
    
    // set an raw-header:
    $snoopy->rawheaders["Pragma"] = "no-cache";
    
    // set some internal variables:
    $snoopy->maxredirs = 2;
    $snoopy->offsiteok = false;
    $snoopy->expandlinks = true;
    
    // fetch the text of the website www.google.com:
    if($snoopy->fetchtext("http://www.google.com")){
           
    		$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; 
    		if(preg_match_all("/$regexp/siU", $snoopy->results, $matches, PREG_SET_ORDER)) { 
    			foreach($matches as $match) { 
    				# $match[2] = link address 
    				# $match[3] = link text
    				// insert the link and text into the database. If the url already exists then basically do nothing
    				$sql = sprintf("INSERT INTO links (url,content) VALUES ('%s','%s') ON DUPLICATE KEY UPDATE url='%s'",
    							         mysql_real_escape_string($match[2]),
    									 mysql_real_escape_string($match[3]),
    									 mysql_real_escape_string($match[2]));
    				$rs = mysql_query($sql);
    			}
    			echo "Operation has completed.";
    		}
    }
    else {
        print "Snoopy: error while fetching document: ".$snoopy->error."\n";
    }
    ?>
    
    PHP:
     
    nation-x, Apr 19, 2008 IP
    ruby90 likes this.
  3. ruby90

    ruby90 Peon

    Messages:
    721
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Thanks nation-x :) rep added

     
    ruby90, Apr 19, 2008 IP
  4. nation-x

    nation-x Peon

    Messages:
    59
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #4
    You might have to make some changes to it... I didn't test it... I just modified this example I found on dnzone to do what you wanted... so there might be errors... I don't debug free scripts :)

    http://snippets.dzone.com/posts/show/2007
     
    nation-x, Apr 19, 2008 IP
  5. nation-x

    nation-x Peon

    Messages:
    59
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Ok... so I decided to test it and it wasn't working so I fixed it... a little... you will need to modify the script to expand the links correctly.

    
    <?php
    /* Database table SQL
    CREATE TABLE `links` (
      `id` int(11) NOT NULL auto_increment,
      `url` varchar(255) NOT NULL,
      `content` varchar(255) NOT NULL,
      PRIMARY KEY  (`id`),
      UNIQUE KEY `url` (`url`)
    ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
    */
    include("snoopy.class.php");
    
    $dbhost = 'localhost';
    $dbname = 'database_name';
    $dbuser = 'username';
    $dbpass = 'password';
    
    $link = mysql_connect($dbhost, $dbuser, $dbpass);
    if (!$link) {
        die('Could not connect: ' . mysql_error());
    }
    
    $db = mysql_select_db($dbname, $link);
    
    if (!$db) {
        die ('Can\'t use '.$dbname.' : ' . mysql_error());
    }
    
    $snoopy = new Snoopy;
    
    // need an proxy?:
    //$snoopy->proxy_host = "my.proxy.host";
    //$snoopy->proxy_port = "8080";
    
    // set browser and referer:
    $snoopy->agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)";
    $snoopy->referer = "http://www.yahoo.com/";
    
    // set an raw-header:
    $snoopy->rawheaders["Pragma"] = "no-cache";
    
    // set some internal variables:
    $snoopy->maxredirs = 2;
    $snoopy->offsiteok = false;
    $snoopy->expandlinks = true;
    
    // fetch the text of the website www.yahoo.com:
    if($snoopy->fetch("http://www.yahoo.com")){
           
            $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
            if(preg_match_all("/$regexp/siU", $snoopy->results, $matches, PREG_SET_ORDER)) {
                foreach($matches as $match) {
                    # $match[2] = link address
                    # $match[3] = link text
                    // insert the link and text into the database. If the url already exists then basically do nothing
                    $sql = sprintf("INSERT INTO links (url,content) VALUES ('%s','%s') ON DUPLICATE KEY UPDATE url='%s'",
                                         mysql_real_escape_string($match[2]),
                                         mysql_real_escape_string($match[3]),
                                         mysql_real_escape_string($match[2]));
                    $rs = mysql_query($sql);
                }
                echo "Operation has completed.";
            }
    }
    else {
        print "Snoopy: error while fetching document: ".$snoopy->error."\n";
    }
    ?>
    
    PHP:
     
    nation-x, Apr 19, 2008 IP
  6. ruby90

    ruby90 Peon

    Messages:
    721
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Thanks again nation-x you're good :)

     
    ruby90, Apr 19, 2008 IP
  7. nation-x

    nation-x Peon

    Messages:
    59
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
  8. Thapar

    Thapar Peon

    Messages:
    6
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Sir,as a part of my engineering project..I have to design a tourism website in which I get all the data by crawling various sites and maintain a database of the same.There is no constraint on the language to be used..but I would prefer using either PHP or Perl.
    Please suggest..which of the above two should I use ( as in which does the same quickly and is easy to gulp )..and do mention some valid resources from where I can get an idea about the same.

    Regards -
    DT
     
    Thapar, Oct 18, 2009 IP
  9. itrana123

    itrana123 Peon

    Messages:
    177
    Likes Received:
    2
    Best Answers:
    0
    Trophy Points:
    0
    #9
    Thanks nice work. It really helps:)
     
    itrana123, Apr 20, 2010 IP