I always enjoy your posts Owlcroft, but in this instance I am going to take exception to the idea that you can rank five billion pages somewhat accurately based solely on the content of the pages. Point number one. Lets remember that the objective of any search engine is to provide (in ranked order) relevant pages to the searchers. Now lets look at that term relevant. Relevant does not mean the page that most accurately fits a prescribed algorithm, it means a page that is relevant to the USER and that is a factor that simply cannot be determined by comparing the words on one page with the words on another. Relevancy in the eyes of the searcher, means a page that is of the most use to the searcher with regards to his present search, and he expects that the most useful page will reside at the top of the list. There are so many things that make a page useful to the searcher including the business proposition, the variety of sources the page provides, the reputation of the pages writer, the accuracy of the content etc etc, and none of these can be detemined by inspection (no matter how many times or by how many filters) of the words on the pages alone. Point number two. Looking at it strictly from the viewpoint of an algorithmic comparasion, how do you analyze a group of say, 1.5 million pages, all on the same topic, and each limited to perhaps a thousand words to determine which is the most relevant based on page content alone? Take two identical pages, and add the search term to one of them once more, it will surely rank better but does that actually make it more relevant? One page uses a bolded header and the other does not, does that really make it more relevant? This is where onpage alogrithmic ranking breaks down, and is the reason to bring links into the equation, as they are seen as a human judgement of a relevant page. Yes it is subject to spamming, but so is onpage text. Yes it is subject to algorithmic manipulation, but so is on page text. Adding links to the equation is just like adding those additional filters you were talking about, none of them are perfect but each helps a little bit. Point three. While perhaps the ideal search engine would have all the really relevant pages at the top of the list with none omitted, the searcher (remember them?) really does not care about this, so long as he finds several useful pages at the top of the list. There is no such thing as a perfect page or a perfect search engine algorithm, but so long as they adequately serve the needs of the searcher, they are successful. This is why it is not IMO necessary to worry about cloaking for example so long as the results presented to the searchers are useful. If the page owner resorts to this trick or that to insure that his page is presented to the searchers first, the average searcher neither knows or cares so long as the page is useful. While I agree the technologically search engines could possibly provide much better search results, they are run as a business, and so long as the searchers are happy with what they produce there is no financial incentive to provide results which cater more closely to some ethical ideal. That search engine which the majority of searchers find provides them with the most useful results will be successful.
Sure. I was just pondering that this afternoon. But you miss a crucial aspect: What would be the point? I spam the SEs with a nonsense page. I even make the nonsense well enough to let it pass several filters designed to find nonsense (lack of grammatical relations, lack of typical accompanying words, whatever else the Phuds can dream up). It gets ranked highly as relevant to, oh, hair shampoo. And? Again: What would be the point? An exploit, to show it could be done? Big whoop. Right now, people spam because spamming a links-based SERP scheme brings visitors to a meaningful page from which the spammer sells something. It can also generate spurious PR to pass on to other pages. Getting a nonsense page ranked high on some search term accomplishes nothing whatever to repay the effort required. You say so, I disagree. This is a matter of the application of intelligence, but making "expert systems" is scarcely an impossibility. Clever folk do it every day. Give me a generous budget for salaries and six months, and I'll have it in hand, GAR-ON-TEED. I don't quote Point #2 because it says the same thing as point #1: "I don't think it can be done." We must just agree to disagree. But I think you go astray by assuming, as it appears you have, the relevance would be determined by such simplistic things as a sheer count of words. The grace a real SE database allows one is the ability to construct, by examining data, patterns of all sorts. One could--and I am just making this up on the fly--examine some modest-sized set of human-approved pages, such as dmoz, to abstract sets of words or phrases typically associated one with another on a page that is meaningful for one of those terms. One could use that established set to compare all sorts of factors. One does not need to "invent" from scratch a concept of meaningful--that work has already been done--one neds only abstract the common qualities of meaningful as opposed to junk pages on a topic. Not simple--I never said simple--but certainly do-able with a substantially better success rate than the current trash method. Well, that's the point, isn't it? They don't. If you troubled to read the entire post, you will see that I concur that the top results are usually more or less relevant. But results at least as relevant, and often more so than many of those at the top, are buried deep, deep into the pages. If you are looking for one particular datum, the current system is probably adequate. If you forget who wrote Hamlet, enter it in and you will soon find out. But if you are serious seeking literary analyses of Hamlet, kiss it off unless you are prepared to spend hour after hour slogging through the first 700 or 900 responses, because if you don't, you may well--probably will--miss some important ones. So which searcher are the SEs serving the needs of? Certainly not the person who wants expansive information on a topic. In reality, many mediocre systems could produce top-20 results about as good as G's--in fact, most of the others probably do on most topics. What an achievement. It is in assuring the user that the top 20 or 30 hits are not simply "relevant", but are highly likely to be the 20 or 30 most relevant hits that G and the others fail, fail miserably I would say.
No Eric I think you have missed the point entirely, and that is that those things which are truly most useful to a searcher cannot be found on the page, and different searchers for the same term may have different requirements which no "expert system" can as yet determine (and IMO may never be able to). Here are a few that I listed quickly in my last post: business proposition the variety of sources the page provides the reputation of the pages writer the accuracy of the content While you blithely say that you can devise a search engine that will do all this: Google have had that generous salary, 150 Phds and several years and have not been able to do it. I would be very interested to know how you propose to out-Google Google, so some specifics please? I am pretty much fully aware of all the various methods in use to determine onpage relevancy Eric, but any serious researcher can see that there are limitations to what can be accomplished by analyzing onpage content. Please explain to me how you can use any technology to examine the onpage content and determine if this or that page is more relevant to any of the following: the business proposition the variety of sources the page provides the reputation of the pages writer the accuracy of the content etc etc. And once you do make such a detemination for one searcher how about the next searcher for the same term that has different requirements? Next do the math on 1.5 million pages (obstensibly on the same topic) and tell me how on earth you can determine which is the most relevant for a searcher (remember I am not saying relevant to an algorithm)? As a simple example, Eric, do a Google search for search engine optimization specifics and tell me how by onpage analysis you can determine which page of those 17,700 pages has the most accurate information? I believe that the difference is that like the search engines, you are tending to emphasize the technical aspects while tending to ignore the needs of the searcher.
Owlcroft, Phud is PhD right? That threw me for a second, I went on google, couldn't find anything. t.
Hi Mel, I still think you are wrong about this, but you seem to have at least done a lot of research. If you could post a link collection / useful searchwords list for supporting your claim, or doing the readings to at least decide one way or the other, that would probably be very useful. I have academic papers in mind, that sort of stuff. thomas. <Note on later edit, sorry, I meant Ashcroft not Mel... whoops...>
Hi Thomas you can start with the subjects in my last post. Do the google search for search engine optimization specifics and you will be rewarded with only 17,700 pages a paltry number as searches in major engines go. Now devise an automated system for determining which of those pages have the most accurate information regarding search engine optimization specifics, and I think that shortly you will start to see that there is no way you can get that information from the onpage data available. OK broaden the conditions and use all the data in the database to determine which page is the most accurate, and I think you will still see that there is no way to rank pages on those terms for accuracy, without reference to some outside help. That is where links come in, to help you in your task of determining which of those is the most accurate for that search term, by using the research and thinking done by thousands of interested individuals. Yes there is great potential for spam but the results will be better than if you use the on-page data alone which is also subject to spam. If you want a to research the subject there are numerous sources (but you have to determine the relevancy yourself ) start by searching for the original design papers for Google by searching for anatomy of a large scale search engine. Then search for patents assigned to Google to get some detail into the kind of thinking that is going into the use of links and for a sample of others thinking try searching for hilltop localrank topic specific page rank These results of these searches should also provide many more links. Have fun.
Mel, Thanks! Actually, I think you and I more or less agree on this topic. I meant to ask Owlcroft / Eric. But that was a good answer anyway Thomas.
OK, we understand that you disagree. Stating your disagreement at length does not make it more or less of a disagreement. If you arrogate to yourself the privilege of simply declaring ex cathedra that "there is no way you can get that information from the onpage data", then I will do the same and say "Yes, there is." OK? Duelling unsupported claims. I haven't got all day for this post, but here are a few thoughts more or less at random off the top of my head. You have a database of billions of pages. You can determine many things from it. For one, you can deduce with high accuracy the variants and synonyms of words--that "car" and "auto" and "automobile" typically designate the same thing. You can make a "connect level" list that shows words typically found in both near and far placement to certain terms, with a weight assigned based on frequency of co-appearance and typical distance from word, and you can apply the synonym list in making this list. Now you can take a selecetd list, perhaps dmoz, and compare things such as just mentioned--and many others that would only occur after some time and thought and effort--and see how, for "approved" pages, they differ from the web norm for pages on which a given topic is mentioned. Indeed, in general, you can play around with the known (and large) set of "approved" pages to see just how, for as many yardsticks as you care to come up with, they differ from the web norm for pages mentioning a given word or phrase frequently. That's only a beginning. I am not a Phud (yup, PhD), and do not claim that I, unaided, working in a garage, could solve all these issues. But it ought to be plain enough that they can be handled. There are computer expert systems that are significantly better than live doctors at diagnosing disease merely from "interviews" with the patients. You think this is a notably harder problem? That's what I meant by "One does not need to 'invent' from scratch a concept of meaningful." In designing an on-page-relevance algo, you have an established large set of pages already human-judged to be both relevant and of good quality. You can keep trying this and that as parameters till you see which ones can essentially duplicate the human-made selections. (In fact, this is a matter that virtually begs for application of neural-network technology, which renders nearly trivial the question of human cleverness in designing algos--that being, after all, what a neural net does: use feedback and "guesses" to eventually produce a filter that will replicate a certain established set of results over a large database, so that the filter can then be applied to new data not in the database.) Why doesn't Google, with its gaggle of Phuds, do this right now? Because they are fixated on "links as votes". You might as well ask why Detroit (and Japan) keep on producing cars with internal-combustion engines instead of the inherently superior external-combustion engine. (Also, for all we know they are working on it. Despite the mass of available evidence to the contrary, maybe they really aren't stupid.)
Most that have studied the issue know that Page Rank is outdated technology as Owlcroft says, computer science experts know this. Owlcroft is correct when he says the Google system is flawed, Google knows this, but the whole system is built around the marketing of "Page Rank", using the toolbar as a data collection and marketing device. So you are not going to throw your flawed technology out the door when your cash flow depends on your flawed technology. Remember, as long as the customer never complains then it is considered a commercially acceptable product. It is still making money, when it quits making money or is outlawed, Google will change.
IMO the type of system you suggest is not much better than onpage analysis, simply because you are reducing thousands of pages to an aggregate average and then saying that pages which use words in the way that you say are better than those who do not. Suppose I write the worlds most authoritative page on a particular subject, using new concepts and definitions, how is your proposed system going to know that this is the definitive page on the project? The answer is that it is not. You keep asserting that it can be done, OWL, but do not demonstrate that this is a fact as opposed to supposition. If it were a fact that this simple techonolgy would solve the relevacy problem, please explain why some clever fellow has not done it? There are literally dozens of new search engines started every year, why do none of them use this simple process?
Let me throw the ball deep into center field. While I think Google does the best job at this moment, IMO Links are already doomed to be downrated in the future as a relevancy factor just as pagerank and onpage factors have been. Why? they can all be manipulated and are not a reflection of the searchers perception of relevancy but the reflection of a webmasters ability to increase onpage factors, pagerank and/or links. The only relevance factor I would consider a real one should come from the searcher, just like we can cast vote's on each others reputation in this forum. The main problem here would be the incentive to implement and cast the vote, I don't think finding an incentive for the searchers will be a major problem, finding an incentive for the search engines may prove a larger problem. The algorithmic implementation should not be that complex, the technical requirements of bandwidth, serving time and storage may prove more troublesome. At this moment google already keeps tally of search clicks and has systems to prevent multiple clicking by one user(adwords/adsense), but why change a winning combination, the competition poses no threat at this moment. The model they are working with at this moment is a lucrative one, a drawback is that they (indirectly) create their own spam whenever their algo's are reverse engineered(always, nothing beats a Phd like a highly motivated webmaster), but the gravest drawback is the static top 30 rankings which means that not all relevant results are shown to searchers, which in my opinion can only show all relevant results in a static top position within a small subset. Finding relevant results at position 800 is just the logical result of having 2 000 000 pages for a particular term, if only .1% is relevant then you will have 2000 relevant pages, of these google will select the "top" 1000 to show according to it's algorithm. The challenge we face as webmasters is getting into the top 30 or 50 that searchers will review and possibly past all the "spam" using the SE's own papers, some simple testing and staying (barely ) within their guidelines. Having said all of this, at this moment my greatest effort will go into SE friendly links, more links, and site size.
I really want to keep dialogues here civil. Even when it is hard. May I suggest RTFP? Where did I say it was "simple"? I believe any reasonable reading of my words will suggest the opposite. But there is a world of difference between "complicated" and "impossible or nearly so". It is on-page analysis. That is the whole point. But the analysis uses, as part of its methodology, what real, live humans have found in related or parallel cases to be useful, relevant, well-done content. (I'm not convinced that doing it by such comparison is necessary, but it makes a nice, straightforward starting point.) Science consists precisely in reducing thousands of diverse data to a standard form that fits them all: such a form is called a law or principle. Deriving laws and principles from diverse data is what science is all about. Moreover, why do you feel a need to repetitively ask why it is not being done? It is not being done because G does not do it that way, just as GM does not make steamer cars, and everyone else feels that to beat G you have to out-G them. Eventually, someone will do it the smart way--that is something you can bet on, save that the time frame is unknowable--and then they will be the new G. But this is more complicated than just thinking up the magic phrase "links are votes" and implementing it with some farily straightforward math, so it's not going to be dreamt up by a couple of bright college kids. Someone with serious money will have to be willing to throw some of it at working out the better way in practical detail. (You do know what a neural network is, don't you?)
Hi Eric yes I do know what a neural network is but that is not what I see you suggesting to use to do on page analysis. There are other search engines who do it differently, look at Teoma and Wisenut for example. The problem as I see it Eric, is that so long as one is using onpage analysis, you simply cannot infer more than what is written on the page. A human on the other hand may use his own knowledge and research to review others work and to compare it and then he may even come to the conclusion that even though Einsteins English and writing was not all that good, his ideas were great. This is the so far limited to humans process called thinking. Do you believe that a neural network can do that? That is one reason to use human input whenever and however possible, even though it may be subject to spam, and even though it may not be perfect. Another reason is that of all the wonderful things that humans have invented there is nothing that understands a human better than another human and search engines are built for humans. Google have shown that the money is there if you can do it only half right, so why are the investors not jumping on the bandwagon if the methodology exists and is readily available?
I think that the PR model to rank sites is bad. its very easy (if you have the amount of money) to place sites very high in Google. anyone can buy PR. I think that the "next thing" -or engine will give the PR less importancy -not as much as G do.
"Do you believe that a neural network can do that?" Yes. OK? "you simply cannot infer more than what is written on the page." What is that when it's at home with its shoes off? "Infer more than is written on the page"? I like to think I know my mother tongue passing well, but I can't make that mean anything. "why are the investors not jumping on the bandwagon if the methodology exists and is readily available?" Obviously, because no one has yet come to a sophisticated investor--this will take _big_ money--and said "Here's how it can be done, and here's why it will be better and succeed in a financial sense." Sooner or later someone will, and then everyone will say "Gee, how original!" Or maybe G itself will finally figure, when they feel hot breath on their collective neck, that it's time to move on and up. . . .
Well actually no its not OK Eric, as I for one do not believe that neural networks are a suitable tool to use to analyze web pages content for relevancy. Neural Networks have been developed for many things, but the only practical applications that seem to have been working are predictive models where an analysis of large amounts of data from the past enables some prediction of future trends and pattern matching, including things like character reconition. If I understand what you are saying, you are suggesting that we take some alread established body of data which humans have established are relevant (what that might be I do not know, since directories are not ranked for relevancy) and based on that look for similar patterns in other pages to predict their relevancy. The problem is of course that there is no such all inclusive body of data available and the fact that no one that I know has ever used a neural network for the methods you propose. The reference to inferring more than is written on the page is a reference to attempting to determine things which are not written on the page from what it written on the page, like the accuracy of the information, the viability of the business proposition, the reputation of the author etc, and of which my point is that to determine the true relevancy of a page you have to know more than just what is written on the page. Sorry if you could not understand that. But lets just agree to disagree, since this is not a productive discussion IMO .
Aside from neural networks not being an essential component of the concept, there remains the point that the description above sounds remarkably like what one would write out were one asked to describe the task to be done in determining web-page relevance. The "large amounts of data from the past" would be the dmoz results, or something like them (if there is anything else). Neural networks have been used to predict the number of runs a baseball team will score in a season, and have been pretty successful. They are scarcely restricted to trivial or "drone-level" mechanical work. Feed a sufficiently powerful n.n. a large mass of roughly contemporaneous web pages, identify which ones dmoz selected, and let it rip. I still don't understand it. I am not sure you realize the depth of the computational (and logical) power that can be brought to bear when you have the entire web (present and past) available as a database, one which you can analyze to extract relevant off-page information. If a page mentions Albert Einstein, data about Einstein, in a form parsable by the system, will be available and will suffice to distinguish him (especially in a given page context) from Mileva Einstein or Albert Schweitzer as far as significance to, say, the music of Mozart. The one-time set-up effort of a methodology like this is huge (but practicable), but after that its operation is incremental, as data are added to the web's total. If you think that "links are votes" is doing a swell job, fine; if not, what's your alternative? And that's about me for this topic.
I don't know if you guys (Owlcroft and Mel) can make any sense out of the link below, but hope this helps, if not. there are other studies linked from the same website. You may have heard of these guys before http://research.microsoft.com/research/pubs/view.aspx?tr_id=690