Hi, I have been working on my own search engine, from scratch. It is a search engine,but with a twist (i'm not saying any more). But basically, is it possible to get another sites index and import it into mine? so it would save me years of reindexing the entire internet? thanks for reading!
Certainly does. For new startups its important to realise you don't need to have the entire web indexed. You just need enough to get you started. Depending on your technology you shouldn't have to import another parties index. Aka if your crawler can index and crawl via IP then perhaps talk with some major web hosts aka HostGator and ask for there server ip's? Or probably even better practise would be to index DMOZ, and several other human edited directories in the countries you want to index content. That will give you a great start and *plenty of data to work with. Open up in alpha stage and offer a submit URL (again depending on your tech you may not/shouldn't need to add ) and before you know it you've got millions of pages indexed and enough to launch with. Remember starting small means just that. No startup 'starts' beating Google, or indexes more sites than Google, and if you claim to do that (cuil) often there are leaks in your technology. Best to start with a good amount but focus on relevancy. You can always add additional servers to handle more indexed pages aka more data at a later stage. You may also look at aggregating a search feed from a current search engine you respect (Big G, Live, Yahoo, searchii (shameless plug) or ask) but place a higher rank weight with the results returned from your index. Once your happy with the amount of sites you have indexed then remove the aggregated providor. Hope this helps.
Woudn't ya have to check first the indexex (indexii?) are compatabile? I'd nearly assume that any search engine that isn't running from a script will have a custom coded index.
I had thought of this too... The index export should be a database,which I could then modify the table names and reorganise the information to suit my tables. Does anyone here have a large web index? I would be willing to pay for a large index.
I think you're missing what an index is. Some indexes are not 'databases' in the sense of sql. They are indexes with their own format - Take a look at http://wiki.apache.org/hadoop/Hbase for example or do a google search on lucene. Im sure you'll find smaller indexes in the format of sql databases - particulalry if you're looking for niche databases - but if you're looking for something truly huge its more likely you'll have to convert the data from somebody else's index into your sql. And I don't know of any index thats accessible with PHP - most are JAVA / C++ etc. You could always do a search for lucene forums and post a request on them, you might find somebody thats willing to sell their data?