Hi, Please inspire me on this subject. I want to create my own website download robot similar to GoogleBot. Does someone know the way to point me to? Thanks!
Check out ASP Tear, this is something that might help. It allows you to visit a site and pull content. Obviously then you would need to parse it. Probably your first step is to look at what data you will gather with your bot and the DB model that will store it. Once your bot is going how deep will is index, will it then go to external links on that same site, then on and on to the next?
I am having one bot that is currently indexing at the rate of about 200000 urls per day (can index much more) Its only indexing meta tags do you want it? Meanwhile to create a google bot you need a lot of resources and other things!
My preference is VC++/mfc for speed. I removed error checking in the code below just to give you an example // This function returns what you'd see in Dreamweaver "code view". // parse this text for various tags you want to look for CString GetHTML( CString szURL ) { CString szURL; CInternetSession session; CInternetFile* file = (CInternetFile*) session.OpenURL(szURL ); if (file) { CString szFileText; while (file->ReadString(szFileText) != NULL) szOutput = szOutput + szFileText; file->Close(); delete file; } return szFileText; }
Nothing "wrong" with using php...whatever works. Just saying that I "prefer" C++ for this kinda stuff. While it probably takes about the same amount of time to make the connection and download the file, I'm 100% sure that processing/parsing the file for tags et al would be much faster using a compiled language like C++ (not a script).
Thanks guys for your answers. Is there any book written on this subject which can provide details in depth? Basically, I want to create my own small search engine.
Here is a simple selfmade robot function getUrlData($url) { $result = false; $contents = getUrlContents($url); if (isset($contents) && is_string($contents)) { $title = null; $metaTags = null; preg_match('/<title>([^>]*)<\/title>/si', $contents, $match ); if (isset($match) && is_array($match) && count($match) > 0) { $title = strip_tags($match[1]); } $count = preg_match_all('/<[\s]*meta[\s]*name="?' . '([^>"]*)"?[\s]*' . 'content="?([^>"]*)"?[\s]*[\/]?[\s]*>/si', $contents, $match); if (isset($match) && is_array($match) && count($match) == 3) { $originals = $match[0]; $names = $match[1]; $values = $match[2]; if (count($originals) == count($names) && count($names) == count($values)) { $metaTags = array(); for ($i=0, $limiti=count($names); $i < $limiti; $i++) { $metaTags[strtolower($names[$i])] = array ( 'html' => htmlentities($originals[$i]), 'value' => $values[$i] ); } } } If(trim($metaTags[description][value])=="") { $count = preg_match_all('/<BODY(.*)>(.*)<\/BODY>/isU', $contents, $odo); $sitecontent = preg_replace('/<script type=(.*)<\/script>/isU', '' , $odo[2][0]); $sitecontent = preg_replace('/<style type=(.*)<\/style>/isU', '' , $sitecontent); $sitecontent = strip_tags($sitecontent); $sitecontent = str_replace("\n", " ", $sitecontent); $sitecontent = str_replace("\t", " ", $sitecontent); $sitecontent = ereg_replace(" +", " ", $sitecontent); $sitecontent = trim(substr($sitecontent,0,280)); } If(!$title) { $title = $url; } $result = array ( 'title' => $title, 'metaTags' => $metaTags, 'sitecontent' => $sitecontent ); } return $result; } function getUrlContents($url, $maximumRedirections = null, $currentRedirection = 0) { $result = false; $contents = @file_get_contents($url); // Check if we need to go somewhere else if (isset($contents) && is_string($contents)) { preg_match_all('/<[\s]*meta[\s]*http-equiv="?REFRESH"?' . '[\s]*content="?[0-9]*;[\s]*URL[\s]*=[\s]*([^>"]*)"?' . '[\s]*[\/]?[\s]*>/si', $contents, $match); if (isset($match) && is_array($match) && count($match) == 2 && count($match[1]) == 1) { if (!isset($maximumRedirections) || $currentRedirection < $maximumRedirections) { return getUrlContents($match[1][0], $maximumRedirections, ++$currentRedirection); } $result = false; } else { $result = $contents; } } return $contents; } $result = getUrlData($url); echo $result[metaTags][description][value]; echo $result[metaTags][keywords][value]) echo $result[sitecontent]; PHP:
You'll need a lot of space if you store content... There is a lot of open source web crawlers. I suggest you a look at this link: http://en.wikipedia.org/wiki/Web_crawler#Open-source_crawlers I never tried anyone of them... good luck
This is a great insight on this topic about how the google bot actually works... The Anatomy of a Large-Scale Hypertextual Web Search Engine http://infolab.stanford.edu/pub/papers/google.pdf
Well i was interested in a bot to get information like the databases of a website.. but then later on i left it..
Long time ago I have developed my spider software in C++ too, I have named my software GreenPenguin and below is a picture . But GP wasn't finished because there are two big problems 1) traffic limit of your internet service provider 2) how to store and operate with crawled data, just imagine how you will store for example 1TB (1024 GB = 1 073 741 824 MB) of crawled websites, how do you want to manipulate with this data, I mean refresh not actual data etc... The only solution it seems is buy some storage unit like here http://www-03.ibm.com/systems/storage/index.html and that's not cheap .
smalldog could you asist me with a creation of a bot getting some information not website cache but database cache that is specified..