Creating a Googlebot

Kasparoff Peon

Messages:: 335

Likes Received:: 24

Best Answers:: 0

Trophy Points:: 0

#1

Hi,

Please inspire me on this subject. I want to create my own website download robot similar to GoogleBot.

Does someone know the way to point me to?

Thanks!

Kasparoff, May 31, 2007 IP

Spartan_Strategy Peon

Messages:: 197

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 0

#2

Check out ASP Tear, this is something that might help. It allows you to visit a site and pull content. Obviously then you would need to parse it.

Probably your first step is to look at what data you will gather with your bot and the DB model that will store it. Once your bot is going how deep will is index, will it then go to external links on that same site, then on and on to the next?

Spartan_Strategy, May 31, 2007 IP

Kasparoff likes this.

Freewebspace Notable Member

Messages:: 6,213

Likes Received:: 370

Best Answers:: 0

Trophy Points:: 275

#3

I am having one bot that is currently indexing at the rate of about 200000 urls per day (can index much more)

Its only indexing meta tags

do you want it?

Meanwhile to create a google bot you need a lot of resources and other things!

Freewebspace, May 31, 2007 IP

The Stealthy One Well-Known Member Affiliate Manager

Messages:: 3,043

Likes Received:: 54

Best Answers:: 0

Trophy Points:: 105

#4

Kasparoff, what's your reason for doing this?

The Stealthy One, May 31, 2007 IP

Aztral Well-Known Member

Messages:: 344

Likes Received:: 15

Best Answers:: 0

Trophy Points:: 125

#5

My preference is VC++/mfc for speed. I removed error checking in the code below just to give you an example

// This function returns what you'd see in Dreamweaver "code view".
// parse this text for various tags you want to look for
CString GetHTML( CString szURL )
{
CString szURL;
CInternetSession session;
CInternetFile* file = (CInternetFile*) session.OpenURL(szURL );

if (file)
{
CString szFileText;
while (file->ReadString(szFileText) != NULL)
szOutput = szOutput + szFileText;

file->Close();
delete file;
}

return szFileText;
}

Aztral, May 31, 2007 IP

Kasparoff likes this.

Freewebspace Notable Member

Messages:: 6,213

Likes Received:: 370

Best Answers:: 0

Trophy Points:: 275

#6

Aztral said: ↑

My preference is VC++/mfc for speed. I removed error checking in the code below just to give you an example

// This function returns what you'd see in Dreamweaver "code view".
// parse this text for various tags you want to look for
CString GetHTML( CString szURL )
{
CString szURL;
CInternetSession session;
CInternetFile* file = (CInternetFile*) session.OpenURL(szURL );

if (file)
{
CString szFileText;
while (file->ReadString(szFileText) != NULL)
szOutput = szOutput + szFileText;

file->Close();
delete file;
}

return szFileText;
}
Click to expand...

What about using php

Freewebspace, May 31, 2007 IP

Aztral Well-Known Member

Messages:: 344

Likes Received:: 15

Best Answers:: 0

Trophy Points:: 125

#7

Freewebspace said: ↑

What about using php
Click to expand...

Nothing "wrong" with using php...whatever works.

Just saying that I "prefer" C++ for this kinda stuff. While it probably takes about the same amount of time to make the connection and download the file, I'm 100% sure that processing/parsing the file for tags et al would be much faster using a compiled language like C++ (not a script).

Aztral, May 31, 2007 IP

Kasparoff Peon

Messages:: 335

Likes Received:: 24

Best Answers:: 0

Trophy Points:: 0

#8

Thanks guys for your answers.

Is there any book written on this subject which can provide details in depth?

Basically, I want to create my own small search engine.

Kasparoff, Jun 1, 2007 IP

syedwasi87 Active Member

Messages:: 2,147

Likes Received:: 59

Best Answers:: 0

Trophy Points:: 90

#9

is there any particular advantage of creating our own bots???

syedwasi87, Jun 1, 2007 IP

Smaaz Notable Member

Messages:: 2,425

Likes Received:: 160

Best Answers:: 0

Trophy Points:: 250

#10

Here is a simple selfmade robot


function getUrlData($url)
{
   $result = false;
   
	$contents = getUrlContents($url);
   

	   
   if (isset($contents) && is_string($contents))
   {
       $title = null;
       $metaTags = null;
       
       preg_match('/<title>([^>]*)<\/title>/si', $contents, $match );

       if (isset($match) && is_array($match) && count($match) > 0)
       {
           $title = strip_tags($match[1]);
       }
       
       $count = preg_match_all('/<[\s]*meta[\s]*name="?' . '([^>"]*)"?[\s]*' . 'content="?([^>"]*)"?[\s]*[\/]?[\s]*>/si', $contents, $match);
       
	   
       if (isset($match) && is_array($match) && count($match) == 3)
       {
           $originals = $match[0];
           $names = $match[1];
           $values = $match[2];
           
           if (count($originals) == count($names) && count($names) == count($values))
           {
               $metaTags = array();
               
               for ($i=0, $limiti=count($names); $i < $limiti; $i++)
               {
                   $metaTags[strtolower($names[$i])] = array (
                       'html' => htmlentities($originals[$i]),
                       'value' => $values[$i]
                   );
               }
           }
       }
	   
	   
	   If(trim($metaTags[description][value])=="") {
			$count =  preg_match_all('/<BODY(.*)>(.*)<\/BODY>/isU', $contents, $odo);
			$sitecontent = preg_replace('/<script type=(.*)<\/script>/isU', '' , $odo[2][0]);
			$sitecontent = preg_replace('/<style type=(.*)<\/style>/isU', '' , $sitecontent);
			$sitecontent = strip_tags($sitecontent);
			$sitecontent = str_replace("\n", " ", $sitecontent);
			$sitecontent = str_replace("\t", " ", $sitecontent);
			$sitecontent = ereg_replace(" +", " ", $sitecontent);
			$sitecontent = trim(substr($sitecontent,0,280));
			
		}
		
		If(!$title) {
			$title = $url;
		}

       $result = array (
           'title' => $title,
           'metaTags' => $metaTags,
	'sitecontent' => $sitecontent
       );
	   

   }
   
   return $result;
}

function getUrlContents($url, $maximumRedirections = null, $currentRedirection = 0)
{
   $result = false;
   
   $contents = @file_get_contents($url);
   
   // Check if we need to go somewhere else
   
   if (isset($contents) && is_string($contents))
   {
       preg_match_all('/<[\s]*meta[\s]*http-equiv="?REFRESH"?' . '[\s]*content="?[0-9]*;[\s]*URL[\s]*=[\s]*([^>"]*)"?' . '[\s]*[\/]?[\s]*>/si', $contents, $match);
       
       if (isset($match) && is_array($match) && count($match) == 2 && count($match[1]) == 1)
       {
           if (!isset($maximumRedirections) || $currentRedirection < $maximumRedirections)
           {
               return getUrlContents($match[1][0], $maximumRedirections, ++$currentRedirection);
           }
           
           $result = false;
       }
       else
       {
           $result = $contents;
       }
   }
   
   return $contents;
}



$result = getUrlData($url);


echo $result[metaTags][description][value];
echo $result[metaTags][keywords][value]) 
echo  $result[sitecontent];

PHP:

Smaaz, Jun 1, 2007 IP

zonzon Peon

Messages:: 100

Likes Received:: 6

Best Answers:: 0

Trophy Points:: 0

#11

You'll need a lot of space if you store content... There is a lot of open source web crawlers. I suggest you a look at this link: http://en.wikipedia.org/wiki/Web_crawler#Open-source_crawlers

I never tried anyone of them... good luck

zonzon, Jun 1, 2007 IP

Spartan_Strategy Peon

Messages:: 197

Likes Received:: 5

Best Answers:: 0

Trophy Points:: 0

#12

This is a great insight on this topic about how the google bot actually works...

The Anatomy of a Large-Scale Hypertextual Web Search Engine
http://infolab.stanford.edu/pub/papers/google.pdf

Spartan_Strategy, Jun 2, 2007 IP

Alis Peon

Messages:: 1,787

Likes Received:: 159

Best Answers:: 0

Trophy Points:: 0

#13

Well i was interested in a bot to get information like the databases of a website.. but then later on i left it..

Alis, Jun 2, 2007 IP

smalldog Peon

Messages:: 66

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 0

#14

Long time ago I have developed my spider software in C++ too, I have named my software GreenPenguin and below is a picture . But GP wasn't finished because there are two big problems 1) traffic limit of your internet service provider 2) how to store and operate with crawled data, just imagine how you will store for example 1TB (1024 GB = 1 073 741 824 MB) of crawled websites, how do you want to manipulate with this data, I mean refresh not actual data etc... The only solution it seems is buy some storage unit like here http://www-03.ibm.com/systems/storage/index.html and that's not cheap .

smalldog, Jun 3, 2007 IP

Alis Peon

Messages:: 1,787

Likes Received:: 159

Best Answers:: 0

Trophy Points:: 0

#15

smalldog could you asist me with a creation of a bot getting some information not website cache but database cache that is specified..

Alis, Jun 3, 2007 IP

Log in or Sign up

Creating a Googlebot

Kasparoff Peon

Spartan_Strategy Peon

Freewebspace Notable Member

The Stealthy One Well-Known Member Affiliate Manager

Aztral Well-Known Member

Freewebspace Notable Member

Aztral Well-Known Member

Kasparoff Peon

syedwasi87 Active Member

Smaaz Notable Member

zonzon Peon

Spartan_Strategy Peon

Alis Peon

smalldog Peon

Alis Peon

Useful Searches