Ok following recent anger I had at posts from search engine creators, well I say creators, I mean script users who claim they are the next Google killer, I had a PM from a Digitalpoint user telling me to put up or shut up, so I decided I would do the former and have since started working on a full PHP crawler based search engine, with a 100% unique backend. I decided being I don’t intend to rival Google (I’ve got about 2 bucks to my name) I would help others along the way, and show some principal aspects of creating a search engine. I hope when I launch mine it will at least be useful to some as it has one unique that no other search engine does, instant indexing. This means as soon as a site posts something it would be on AstraWeb See AstraWeb’s Crawler in action The crawler is running right now gathering data for the 1st of June launch, so in turn you can view what sites are being crawled. If you want to know how to index your site, just post here (not sure if anyone wants to). The crawler has a name of ‘Jack Astra’ so if you see this in your stats it means we’ve indexed you. http://astrasearch.org/jack.php The crawler A search engine is pretty easy to code, however a crawler tends to be the hard part. I’ve spent the last month working out different ways of doing it with limited server resources. At first I tried free open source engines such as Nutch but I found them limiting and slow with my given resources, so then I tried to code a simple PHP crawler, the biggest problem I had was getting the correct description for each site. If you’re thinking about creating a search engine I would advise doing a two level based system. First you should check if the metatags on the website your crawling are unique per a page or are boogie meta tags (all the same) then check the site content to check against meta tags to make sure it’s not a spam site, then insert into database. However the problem sites are those without metatags. I went through a range of ways of getting a description and so far the best option I come up with uses about 2000 lines to find the best place to start and stop text extraction. It still is not perfect. Crawler Basic Functions Start from input sources > Extract all links from input and follow the first while other threads follow the others > apply your algoythm to each link > Insert into some sort of database
World most easiest search engines are only Meta Search Engines. Nothing to do any hard work for Meta Engines. Just collect information from Title tag and Meta description tag, that's all. If your search engine is a Meta engine, you may face some problems if you gonna to index pages in a same domain. Collecting information from only main page, is not problem. But if you wanna index more than one pages in a same domain with meta engine, that's will be a problem. I have found lots of sites are using same Meta description on their pages. So, in the indexing time, it has more chance to collect same description from every page. Suppose, you're indexed a www.mydomain.com and collected total 8 pages from that site. If all the pages are using same Meta description, means you're showing same description in those 8 pages. So, I like to index from non-Meta pages. If a web page is using Meta Description, all the search engines like Google, Yahoo, MSN etc. will show the same result (Description) for that site. I think a Search Engines need to show different description for a site from other search engines. Collected description from Meta means same result. Just try to collect description from page body. I know, showing a description without any source code is very hard if you're indexing from page body. So, more than 95% search engines are Meta not like Google. In the month of March 2008, when my crawler indexed a page from body, lots of source code found in the description. But now, I can give guaranteed, if a major search engine is showing source code from a particular page in the description instead of having lots of contents on that page, my crawler will show 99.99% without source code. I am just making a WhiteBoard where anyone can see page index. My crawler has capacity to index a page from both Meta and Non Meta pages. So, now I don't mind where a page is using Meta description or not. Just, if my crawler got minimum 20+ characters in a pages (included anchor texts), that's sufficient for the description of that page.
Heh, if only it was about just collecting meta tags. Of course creating a search engine running of just meta tags would mean total spamnation and bad results.