"Majestic-12 is working towards creation of a World Wide Web search engine based on concepts of distributing workload in a similar fashion achieved by successful projects such as SETI@home and distributed.net." Project site: http://www.majestic12.co.uk/ Search site: http://majestic12.kicks-ass.org:8888/ This looks very promising. I like the idea and looks like this Alex guy knows what he is doing.. M12: Results 1 - 10 of about 5,075,015 for search engine optimization (0.468 secs) Google: Results 1 - 100 of about 143,000,000 for search engine optimization. (0.60 seconds) Although M12 has "only" 1B pages in index for now it probably could compete with big players in the future.. With the help of community, of course..
I tried a more complicated search. The results: Google: Results 1 - 10 of about 238,000 for lentil market news. (0.35 seconds) Majestic Results 1 - 10 of about 4,868 for lentil market news (0.343 secs) This is a very impressive speed comparison! However, the relevancy was questionable. This was especially true of the "Best Relevancy" search option. The first result was for MSN.com, which does not offer lentil market news. The second result was for Amber Mac. None of the top results were relevant. Default search mode returned one relevant result, the foruth; while "alexc's magic recipe" also return one -- the first. You cannot alter Google search type. All the top 10 results were relevant. This site and the software behind clearly have potential. But, their algorithms need work.
One of the reasons for this could be there is "only" 1 billion pages (google has circa 9) in index. You should try and compare more general search results... I know it is way from being perfect, but looks very promising. Fast search over 1 billion pages with his set up - picture... Do you imagine L. Page standing next to this servers?
I agree that being one-nineth the size will have an impact on the relevancy of results. On the other hand, I know that a site which covers the terms I tested is indexed by Majestic. I am unsure how they will tweak their algorithms to improve results. It is a very interesting and fascinating area. It should probably take into account how people respond to the results. I recall a discussion when Google was developing its algorithms wherein it was said the response of the searcher was as important in determining relevancy, and therefore the page rank of sites, as the weight of the terms in the site itself. I took a very simple approach in the search engine I wrote for my site. It has almost 200,000 documents in the index and uses a very simple weighting algorithm -- how many times did each word in the search appear in the document. It works on a single site because the documents are already relevant. A better search algorithm presumably considers word placement. How close do the terms appear to one another given the order of the terms in the search. And so forth. His project has awesome prospects and amazing potential as long as people remain keen to get involved, offering their unused bandwidth and CPU cycles. But, once indexed, what next for the internet? That is the question Google et al struggle to answer.
Hi all! Very interesting comments, thanks for that! User-submission tool was sadly switched off (but its on now) because it seemed to have leaked memory, so in interest of server's stability I had to switch it off, however this is fixed now. Relevancy: this is a loaded topic, but no doubt the most important. It is obvious that our relevancy of multi-word searches is poor - "Best relevancy" ranking algorithm is a user-defined (users can change how ranking works in our search engine) that was named somewhat ambitiously. Most of the work up until now focussed around scaling client and search engine - now we have got code that is reasonably fast with 1 bln pages and it can scale easily by adding more machines to the cluster. Performance improvements will be added shortly - we have barely started working on it! What you see now is work of 5 months - all other time was put into making sure client works well, strange, but running crawler in non-lab conditions of dedicated network is not easy, but its solved. So, lots more work is needed, and it will happen - check us out in a few months and I am sure you will appreciate improvements in relevancy and speed, and more pages indexed of course - you can't find a black cat in a dark room when the cat has escaped regards, alexc Majestic-12
On the other hand, you do not know whether or not there is a cat in the room until the moment of checking. If it has escaped, you do not know whether it was there in the first place. I like the work you have done so far. I see the potential for some good alternatives to what is already out there.
Good work there. But as all said, relevancy needs a little work. May be when more pages come in there would be better results to show. Why not start by indexing pages linked from important sites. Like pages in google directory,DMOZ and other important directories and so on... regards jeet
Thanks for popping in man... Have you got people checking on this or anyone stress testing it for spam?
Thanks guys, kind words much appreciated. A lot of the poor relevancy now is due to rather primitive way how multi-word searches are ranked, I've got some ideas how to change that and expect big improvements starting later this month. Important sites are going to get crawled and indexed separately - more frequently to have "fresh" feeling of the search engine. Spam is a very loaded topic, some work was done in this direction but a lot more will be added - particularly to catch those nasty sites with junk keywords and sneaky javascript redirect elsewhere
I see your crawler coming through my site all the time. I can't imagine how this would succeed where the grub project did not. Does anyone see any search results from majestic or is it just still collecting links?
I used and I am impressed with results and ofcource for the community. I want ot congrats for your work and I will try to crawl... as soon as I can .....
Hi, sorry for not responding earlier - I do not check this forum regularly (only came because received a PM). We have got a search engine up and running and it will be expanded greatly in the next few months - this is not an easy project from pretty much all points of view, but I think we are already more successful than Grub and will continue to be so in 2007 and beyond.