Hi, What are the scripts to run your own crawler? I've found php/mysql one, but it does not allow visitors to submit their sites for crawling. Is it really complicated process to build your unique search engine?~ Thanks-regards
Here are some open source projects: - http://lucene.apache.org/nutch/ - http://grub.org/ (used by Wikia.com also) - http://code.google.com/p/gungho-crawler/wiki/Index - http://java-source.net/open-source/crawlers - http://www.searchtools.com/robots/robot-code.html - http://www.cs.cmu.edu/~rcm/websphinx/ - http://www.noviway.com/Code/Web-Crawler.aspx Now, the hard work is to choose one of them ...
Here's another one: - http://spinn3r.com ----------- Why write your own weblog spider when you can just use ours? Spinn3r is a web service for indexing the blogosphere. We provide raw access to every blog post being published - in real time. We provide the data and you can focus on building your application / mashup. Spinn3r handles all the difficult tasks of running a spider/crawler including spam prevention, language categorization, ping indexing, and trust ranking. ----------
They should concentrating in getting visitors and PR instead of advertising...in US they had 1,000 somethig visitors/month (according to compete.com)...I guess they don't have more than 4-5000 around the world...for a search engine that's really a small number...
Techcrunch just posted part of an article of mine regarding crawling the web from home: http://www.techcrunch.com/2008/05/23/therarestwords-intriguing-semantic-seo-project-from-russia/
i am not sure what is your end goal, but if you are trying to build a specialized search engine which searches only specific sites, you can try "google custom search" which searches only a bunch of sites you specify