Nutch is a open source software code that is the platform for new engines to build on. Check out the following http://labs.yahoo.com/demo/nutch/ http://www.objectssearch.com http://www.mozdex.com These are all new OPEN Source engines using the www.Nutch.org platform.
Ha-ha yeah, no tricking necessary, you'd know the algo by looking at the source. Let's hope Google goes open source.
Yahoo is looking at it for a reason, you will just have to study the issues. Again, I think that a company like IBM, with all of the supercomputing and database software technology will win out in the end. In the mean time, a lot of players will try a lot of new things. It is just a matter of time, search will become a commodity that will be able to be purchased from IBM wholesale, then small companies will add value and their own spin on that data and create a end user interaction with their flavor. Just think we could have "MasteroftheUniversesearch.com" powered by IBM. Now you know a programmer with Shawns skills could pull it off, I just wonder what flavor he would add to the search function? Or a programmer like Shawn or anyone else can download the www.nutch.org open source code for free and start their own engine. Who knows who will win the SearchEngineWars.whatever, in the end only those with the most resources win, IBM has the most resources.
Open source search engines (or even ones that are licensed with the algorithms) will never truly be competitive for relevant results, because of the fact they ARE open source. Which means SEO people can get into the guts of them, and see exactly what it considers important for relevancy. So they are wide open to SEO. It would be the same as if Google publicly disclosed their ranking algorithms. Instantly you would see a lot more crap at the top of the results because people know exactly what the search engine deems important.
You think IBM could possibly pull a search engine out of their hat? Thats an interesting thought. However Microsoft have similar scale resources, and may pull it off. Then again, Yahoo! and Google have the power of their brands on their side.
IBM was involved in search when Larry and Sergey were 10 years old, read the thread here at Digital Point on "IBMtheKingofSearch?" and look at the articles linked from there. Shawn below is a interview with the programmer and one of the creators of Nutch, you guy's talk the same talk, two peas in a pod! http://blog.outer-court.com/archive/2004_05_28_index.html#108573025728740424
It's true that it might be easier to manipulate the rankings if you have the source code. This is what nutch.org has to say in their FAQ: Won't open source just make it easier for sites to manipulate rankings? Search engines work hard to construct ranking algorithms that are immune to manipulation. Search engine optimizers still manage to reverse-engineer the ranking algorithms used by search engines, and improve the ranking of their pages. For example, many sites use link farms to manipulate search engines' link-based ranking algorithms, and search engines retaliate by improving their link-based algorithms to neutralize the effect of link farms. With an open-source search engine, this will still happen, just out in the open. This is analagous to encryption and virus protection software. In the long term, making such algorithms open source makes them stronger, as more people can examine the source code to find flaws and suggest improvements. Thus we believe that an open source search engine has the potential to better resist manipulation of its rankings. ---- End of Quote ---- I think the idea is that the ranking algorithm will be adapted so quickly that the SE may be able to win the cat and mouse game against SEO optimizers because so many programmers will be writing the code. The main idea behind a search engine is of course to eventually return the pages that a human would deem most relevant if he/she had read every single page on the net and personally answered the query. Please note that the users of the open source code can tweak the importance that the search engine puts on certain aspects and they don't have to make that public. So you're far from knowing what exactly will make your page rank highly. You already know that most SEs look at headlines, titles, keyword density, link popularity, etc. You just don't know the weight of each of these aspects. Christian
As with the "Hilltop algo" thread, clearly the gold standard is an algo that cannot effectively be spammed. That may sound at first blush like an oxymoron, but not necessarily. In any event, the clear point of the OS advocates is that if we don't try, we'll never know. What this or that person or group may not be able to come up with, the entire programming world may. Clearly, a sufficiently large panel of intelligent humans could--if not with ideal speed--be an absolutely "unspammable" form of "algorithm"; so the question boils down to "Can we make an 'expert system' sufficiently close to human judgement?" Has anyone yet done any work with neural networks? That looks to me like the most promising avenue of approach right now.
Owlcroft, you speak of AI, really if computer search is converging with AI which is still way off, many of us will not be around when this happens. You are right on when you speak of the goals of the SE executives in their search for better search. Industry experts have been saying the same things you just mentioned.
Has anyone yet done any work with neural networks? Quote by Owlcroft above Who are these folks Owlcroft? Can you tell us what they do and give a link for them? Thank you
MUSHROOM when you find a live link for the company he mentioned "Neural Networks" let us all know, would you?
I have assumed--rightly? wrongly?--that everyone here knows what a neural network is and, at least in a broad-brush way, how such things work. This is scarcely science-fiction futurism: neural networks are doing useful work right now. (I don't believe anyone classes neural networks as "AI", which is a dubious concept. I personally think AI is, in the very long run, an achievable goal, but who am I to argue with Roger Penrose?) If none of the leading players in SE are now looking long and hard at neural networks, I am one surprised puppy.
Where is all the processing power, and the data storage, going to come from? What popped into my head is a kazaa/skype type thing that somehow interoperates with nutch. Users could determine their seeds, algos, and seo spam filters, for the results that their "node" is responsible for. Problem is, we don't really have p2p db. And I don't know if authority graph type (aka pagerank) calculations can effectively be performed on a distributed network. But if they could... it would be cool! Seo spam in a zero info regime is such a frustrating, stupid (if lucrative) problem to waste brainpower on! thomas. ps write it in perl!