1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Creating a Googlebot

Discussion in 'Programming' started by Kasparoff, May 31, 2007.

  1. #1
    Hi,

    Please inspire me on this subject. I want to create my own website download robot similar to GoogleBot.

    Does someone know the way to point me to?

    Thanks!
     
    Kasparoff, May 31, 2007 IP
  2. Spartan_Strategy

    Spartan_Strategy Peon

    Messages:
    197
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Check out ASP Tear, this is something that might help. It allows you to visit a site and pull content. Obviously then you would need to parse it.

    Probably your first step is to look at what data you will gather with your bot and the DB model that will store it. Once your bot is going how deep will is index, will it then go to external links on that same site, then on and on to the next?
     
    Spartan_Strategy, May 31, 2007 IP
    Kasparoff likes this.
  3. Freewebspace

    Freewebspace Notable Member

    Messages:
    6,213
    Likes Received:
    370
    Best Answers:
    0
    Trophy Points:
    275
    #3
    I am having one bot that is currently indexing at the rate of about 200000 urls per day (can index much more)


    Its only indexing meta tags

    do you want it?


    Meanwhile to create a google bot you need a lot of resources and other things!
     
    Freewebspace, May 31, 2007 IP
  4. The Stealthy One

    The Stealthy One Well-Known Member Affiliate Manager

    Messages:
    3,043
    Likes Received:
    54
    Best Answers:
    0
    Trophy Points:
    105
    #4
    Kasparoff, what's your reason for doing this?
     
    The Stealthy One, May 31, 2007 IP
  5. Aztral

    Aztral Well-Known Member

    Messages:
    344
    Likes Received:
    15
    Best Answers:
    0
    Trophy Points:
    125
    #5
    My preference is VC++/mfc for speed. I removed error checking in the code below just to give you an example

    // This function returns what you'd see in Dreamweaver "code view".
    // parse this text for various tags you want to look for
    CString GetHTML( CString szURL )
    {
    CString szURL;
    CInternetSession session;
    CInternetFile* file = (CInternetFile*) session.OpenURL(szURL );

    if (file)
    {
    CString szFileText;
    while (file->ReadString(szFileText) != NULL)
    szOutput = szOutput + szFileText;

    file->Close();
    delete file;
    }

    return szFileText;
    }
     
    Aztral, May 31, 2007 IP
    Kasparoff likes this.
  6. Freewebspace

    Freewebspace Notable Member

    Messages:
    6,213
    Likes Received:
    370
    Best Answers:
    0
    Trophy Points:
    275
    #6

    What about using php
     
    Freewebspace, May 31, 2007 IP
  7. Aztral

    Aztral Well-Known Member

    Messages:
    344
    Likes Received:
    15
    Best Answers:
    0
    Trophy Points:
    125
    #7
    Nothing "wrong" with using php...whatever works. :)

    Just saying that I "prefer" C++ for this kinda stuff. While it probably takes about the same amount of time to make the connection and download the file, I'm 100% sure that processing/parsing the file for tags et al would be much faster using a compiled language like C++ (not a script). :D
     
    Aztral, May 31, 2007 IP
  8. Kasparoff

    Kasparoff Peon

    Messages:
    335
    Likes Received:
    24
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Thanks guys for your answers.

    Is there any book written on this subject which can provide details in depth?

    Basically, I want to create my own small search engine.
     
    Kasparoff, Jun 1, 2007 IP
  9. syedwasi87

    syedwasi87 Active Member

    Messages:
    2,147
    Likes Received:
    59
    Best Answers:
    0
    Trophy Points:
    90
    #9
    is there any particular advantage of creating our own bots???
     
    syedwasi87, Jun 1, 2007 IP
  10. Smaaz

    Smaaz Notable Member

    Messages:
    2,425
    Likes Received:
    160
    Best Answers:
    0
    Trophy Points:
    250
    #10
    Here is a simple selfmade robot
    
    function getUrlData($url)
    {
       $result = false;
       
    	$contents = getUrlContents($url);
       
    
    	   
       if (isset($contents) && is_string($contents))
       {
           $title = null;
           $metaTags = null;
           
           preg_match('/<title>([^>]*)<\/title>/si', $contents, $match );
    
           if (isset($match) && is_array($match) && count($match) > 0)
           {
               $title = strip_tags($match[1]);
           }
           
           $count = preg_match_all('/<[\s]*meta[\s]*name="?' . '([^>"]*)"?[\s]*' . 'content="?([^>"]*)"?[\s]*[\/]?[\s]*>/si', $contents, $match);
           
    	   
           if (isset($match) && is_array($match) && count($match) == 3)
           {
               $originals = $match[0];
               $names = $match[1];
               $values = $match[2];
               
               if (count($originals) == count($names) && count($names) == count($values))
               {
                   $metaTags = array();
                   
                   for ($i=0, $limiti=count($names); $i < $limiti; $i++)
                   {
                       $metaTags[strtolower($names[$i])] = array (
                           'html' => htmlentities($originals[$i]),
                           'value' => $values[$i]
                       );
                   }
               }
           }
    	   
    	   
    	   If(trim($metaTags[description][value])=="") {
    			$count =  preg_match_all('/<BODY(.*)>(.*)<\/BODY>/isU', $contents, $odo);
    			$sitecontent = preg_replace('/<script type=(.*)<\/script>/isU', '' , $odo[2][0]);
    			$sitecontent = preg_replace('/<style type=(.*)<\/style>/isU', '' , $sitecontent);
    			$sitecontent = strip_tags($sitecontent);
    			$sitecontent = str_replace("\n", " ", $sitecontent);
    			$sitecontent = str_replace("\t", " ", $sitecontent);
    			$sitecontent = ereg_replace(" +", " ", $sitecontent);
    			$sitecontent = trim(substr($sitecontent,0,280));
    			
    		}
    		
    		If(!$title) {
    			$title = $url;
    		}
    
           $result = array (
               'title' => $title,
               'metaTags' => $metaTags,
    	'sitecontent' => $sitecontent
           );
    	   
    
       }
       
       return $result;
    }
    
    function getUrlContents($url, $maximumRedirections = null, $currentRedirection = 0)
    {
       $result = false;
       
       $contents = @file_get_contents($url);
       
       // Check if we need to go somewhere else
       
       if (isset($contents) && is_string($contents))
       {
           preg_match_all('/<[\s]*meta[\s]*http-equiv="?REFRESH"?' . '[\s]*content="?[0-9]*;[\s]*URL[\s]*=[\s]*([^>"]*)"?' . '[\s]*[\/]?[\s]*>/si', $contents, $match);
           
           if (isset($match) && is_array($match) && count($match) == 2 && count($match[1]) == 1)
           {
               if (!isset($maximumRedirections) || $currentRedirection < $maximumRedirections)
               {
                   return getUrlContents($match[1][0], $maximumRedirections, ++$currentRedirection);
               }
               
               $result = false;
           }
           else
           {
               $result = $contents;
           }
       }
       
       return $contents;
    }
    
    
    
    $result = getUrlData($url);
    
    
    echo $result[metaTags][description][value];
    echo $result[metaTags][keywords][value]) 
    echo  $result[sitecontent];
    
    PHP:
     
    Smaaz, Jun 1, 2007 IP
  11. zonzon

    zonzon Peon

    Messages:
    100
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    0
    #11
    zonzon, Jun 1, 2007 IP
  12. Spartan_Strategy

    Spartan_Strategy Peon

    Messages:
    197
    Likes Received:
    5
    Best Answers:
    0
    Trophy Points:
    0
    #12
    Spartan_Strategy, Jun 2, 2007 IP
  13. Alis

    Alis Peon

    Messages:
    1,787
    Likes Received:
    159
    Best Answers:
    0
    Trophy Points:
    0
    #13
    Well i was interested in a bot to get information like the databases of a website.. but then later on i left it..
     
    Alis, Jun 2, 2007 IP
  14. smalldog

    smalldog Peon

    Messages:
    66
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #14
    Long time ago I have developed my spider software in C++ too, I have named my software GreenPenguin and below is a picture :). But GP wasn't finished because there are two big problems 1) traffic limit of your internet service provider 2) how to store and operate with crawled data, just imagine how you will store for example 1TB (1024 GB = 1 073 741 824 MB) of crawled websites, how do you want to manipulate with this data, I mean refresh not actual data etc... The only solution it seems is buy some storage unit like here http://www-03.ibm.com/systems/storage/index.html and that's not cheap ;).

    [​IMG]
     
    smalldog, Jun 3, 2007 IP
  15. Alis

    Alis Peon

    Messages:
    1,787
    Likes Received:
    159
    Best Answers:
    0
    Trophy Points:
    0
    #15
    smalldog could you asist me with a creation of a bot getting some information not website cache but database cache that is specified..
     
    Alis, Jun 3, 2007 IP