Debt Consolidation - Find jobs - Find jobs - Debt Consolidation - Debt Consolidation

PDA

View Full Version : Google "Sandbox" Threshold with "Hysteresis"


hexed
May 21st 2004, 11:27 pm
I've been doing some thinking and research on the entire sandbox effect, and here's a preliminary thoery that just popped in to my head. I also understand that there have been similar theories on the subject and this is not completely new. However, I _believe_, keywords are "I believe", that the theory presented below may be the most accurate of all of them to date. The main point here is _HYSTERESIS_. Please try to understand the concept. I know it may be tough if you've never heard the word before.

THIS IS JUST A THOERY AND HAS NOT BEEN PROVEN - PLEASE TREAT IT AS SUCH.

UNTESTED THEORM:

Google applies a dampening effect to new sites specific to the keyword or keyword combination. The amount of dampening and the length of time it takes to be even listed is proportional to the amount of sites that are already listed in the SERPs for those specific keywords.
i. ranking threshold - is also applied (see below).
ii. gluttonous gathering - accumulation of links too quickly will have a global negative effect on the formula/theory (see below).

EXAMPLES:

CASE 1: Adding a new site or page to the Google SERPs where there are already 10 pages listed for "some odd keyword combination". This may only have a dampening effect of 10% and a time to be listed (TTBL) of 3 days.

CASE 2: Adding a new site or page to the Google SERPs where there are already 1,000,000 pages listed for "common keyword". This may have a large dampening effect of 99% and a time to be listed (TTBL) of 3 months.

RANKING THRESHOLD (WITH HYSTERESIS):
To further complicate the problem, some sites may have to pass a "ranking threshold" to even be added into the SERPs. This explains why some sites have been in the sandbox for over a year. I'm not talking about a threshold of the 10000th site in the SERPs for that keyword. I'm talking about it may have to pass the threshold of the 50th site in the SERPs for that keyword. If there are any electrical engineers out there, the words/device "Schmitt Trigger (http://ourworld.compuserve.com/homepages/g_knott/elect344.htm)" immediately comes to mind. Also - a dynamic factor that seems to be playing in the SERPs is what engineers also call "hysteresis (http://www.fact-index.com/h/hy/hysteresis.html)". Even though hysteresis plays an important factor and that it's paramount to understanding the definition, the logical operation of the Schmitt Trigger explains the entire ranking threshold with hyteresis perfectly. Please look again at the Schmitt Trigger.


GLUTTONOUS GATHERING:
I also firmly believe that gathering links too quickly may extend the TTBL and increase the dampening effect - for obvious spammy reasons. This has already been seen with the "dream team" nigritude ultramarine site.

Hexed

Comments Please!

laiwa
May 22nd 2004, 1:03 am
Yes there is definatly some algoritm that isnt linear. I can relate some to adaptive regulation methods used in industrial engineering. There are very well developed and advanced systems to regulate industrial processes. These can be easily implemented to this ranking process to. It is of course another thing then the actual "preliminary ranking" that is based on online and offline content. This secondary (one of many) function might be applied to regulate the speed of movement in the serps and the relative ranking of the sites. Age can be a factor here, if you look at a basic PID regulator function, it is based on the integral and the derivative as well as the control error itself. If there is a "regulating" function it will of course be a lot more advanced but it should be possible to detect it with some experimentation.

Foxy
May 22nd 2004, 1:24 am
Hexed

That is nice thinking and fits in with what I have been seeing and not yet analysed - and that is for some "not to well" optimised pages that I have [yeah I know you all will say every page ha ha] I have been seeing them rise, fall out, come back , fall out and so on, and the only ones with stability are the ones that count in the factors that we have discussed elsewhere and are experimenting with. :)

I look forward to seeing what the others say.

laiwa
May 22nd 2004, 1:51 am
If they use a regulation functions it should look someting like this one below, a bit hard maybe to see how it would be implemented right off but a polynom function would do the dampening job atleast. The more the site is "tweked", the more it will be dampened (derivative), also the longer it stays in one position, the more it takes to get it moved from there (integral). The function shown below is actually a quite simple polynom that can do this. If they apply self adaptive functionality, nearly all modern regulators have this, they can actually set the different variables for each keyword. That would explain the different results for "hard" and "easy" keywords. I dont think its threshold in itself, its more that the regulating function adapts to the changes in the serps in some way. Knowing that Google wants to have algoritms doing the job, they must in some way be in to this:
Simple polynom (http://de.wikipedia.org/math/a0da3cee04ac20c0211ed081ceae2763.png)

hexed
May 22nd 2004, 1:57 am
Thanks for your comments, both of you.

I'd just like to add that, in the schmitt trigger example, the 5.0v, or Output, would represent if the website was to be included into the SERPs or not. When your website variables (input) reach a certain threshold (v1.7), the switch is flipped and you instantly appear where you crossed the threshold in the SERPs.

This explains why nobody ever really shows up at 9000 in the SERPs, but rather 100. The threshold would be set at 110 or so for that keyword combination, and until you pass the threshold, you will be filtered and not show up on the SERPs.

Do I sound insane?

Hexed

Foxy
May 22nd 2004, 2:31 am
Do I sound insane?


No completely rational, talking of which where is he? Hehe

Thanks Iaiwa I haven't looked at such a formula in a long time. This forum just gets better and better.

The things we sometimes forget about search engines is that they apply mathematical formulae as determined [read applied] by the humans who designed them, and when applied in the total abstract can become/is an art form.

laiwa
May 22nd 2004, 3:17 am
It is quite easy, whithout any greater computer power, for google to apply a secondary "regulator filter" to the serps. Most likely it is some combination of a soft algoritm and a threshold function(s). This also explains why some sites "hang around" for a while after they have been changed or deleted eaven though they are spidered. If Google has gone adaptive, they will weigh the different regulation and threshold variables depending on the competition for the keyword and how much movements there is in the serps for that specific keyword. This could mean that a site that for example, gets lots of backlinks quickly, then triggers the derivative part of the function. This could for a competative keyword generate a major negative penalty, it goes into the clouds. With time the derivative goes down of course and the site goes up in the serps. This could look like a "threshold" but it is actually a derivative function. The polynom could wheigh both onsite and offsite factors in this model. Why I keep speaking about the polynom is because this is the type of functions that Google likes, like the Page Rank function. They dont need very many of these before they really have complicated life for us. The polynom can also quite well model these types of processes that exist in the serps.

nohaber
May 22nd 2004, 6:21 am
Hexed,
let me guess. You've never really been a computer programmer. :eek:

Now, I'll give you my "SEO theorem".

If you've never been a computer programmer solving at least basic algorithmical problems. If you don't know stuff like algorithm complexity, graph algorithms, NP-completeness etc.etc. If you don't know math etc.etc. If you don't THINK like a programmer or software engineer, if you CAN'T THINK like the guys/gals at google,yahoo, etc. HOW DO YOU MAKE UP SERP THEORIES?

Let me tell you something. Some years ago I went on my first programming competition. I had a lot of programming experience. I had written so much code in so many different languages and so much difficult stuff like writing in assembler. I thought I knew programming. I got my a** kicked because the only algorithm that I knew was backtracking (trying all possible solutions to a problem, which takes a much more time than allowed on competitions).

Then I learned something. If you don't have experience with algorithms, you can't solve such tasks. In the following years I got coached by the professor who trained our national university and school programming teams and learned a lot about algorithms, that I even got one first national place and a lot of other local first places. During this time I was drinking like a horse and wrote much less code than before. I was becoming a much lazier and untrained programmer. But I got to win lots of programming competitions even though I was such an alcoholic and out-of-shape programmer.

The take home message is: if you don't know algorithms, if you haven't solved such problems, if you haven't participated in programming contests where you solve problems efficiently to score points - YOU HAVE NO CLUE OF WHAT YOU ARE TALKING ABOUT. I am telling you, this thing is very specific. Even programmers with lots of years behind their backs can't do well when it comes to programming such tasks efficiently.

That's about 99% of the SEO experts out there. You have to know what's possible and what's feasible.

Thank you :cool:

hexed
May 22nd 2004, 8:04 am
Hexed,
let me guess. You've never really been a computer programmer. :eek:

Now, I'll give you my "SEO theorem".

If you've never been a computer programmer solving at least basic algorithmical problems. If you don't know stuff like algorithm complexity, graph algorithms, NP-completeness etc.etc. If you don't know math etc.etc. If you don't THINK like a programmer or software engineer, if you CAN'T THINK like the guys/gals at google,yahoo, etc. HOW DO YOU MAKE UP SERP THEORIES?

Thank you :cool:That's really quite funny and I have no idea where you got that impression. Also, I don't know what side of them bed you woke up on because even if I wasn't a programmer, it's a pretty nasty statement to make to someone. I don't want to sound like I'm ego tripping here, but I think it's time to put you in your place.

I have programmed for over 20 years, probably before you knew what a computer was.

I am a computer engineer with a master's degree and I code in assembler, c, c++, java, php, perl, the list goes on. I also design autonomous robots from scratch and program them in micro assembler floating point code using a vector-driven based drive system with 360 degree motion.

I also teach assembler and robotics to university students part-time.

Just because you got your a** kicked in a competition because all you knew were linear O notation algorithms, don't take it out on people trying to research and development. And if you do, don't go attacking people personally because it's very frowned upon specially when we're trying to assist you. But then again, this is probably why you're making calorie counters and fitness graphs instead of doing some real R&D and engineering.

I think its time to bite your tongue. :rolleyes:

Hexed

PS - Sometimes I don't even know why I try to assist people like you.

nohaber
May 22nd 2004, 8:51 am
hexed,
you don't get the point. You may have 100 years of programming, but it does not help understanding algorithms. It's all specific. Someone who has developed 10000 php sites, can't solve a problem with dynamic optimization(just an example).

Now, I have a question. How would google implement this part of your theory?

"Google applies a dampening effect to new sites specific to the keyword or keyword combination. The amount of dampening and the length of time it takes to be even listed is proportional to the amount of sites that are already listed in the SERPs for those specific keywords."

Please, be specific. How would you code this into a search engine? :)

btw. if you live in my country, you'll never want to do R&D because there are better paid jobs ;)

compar
May 22nd 2004, 9:12 am
I've attacked a few people in forums in my day, but I have never seen such a villianous and unprovoked attack as the one by nohaber on hexed.

It was totally uncalled for and the worse thing of this type we have seen on this forum to date. Hexed, don't allow him to suck you into this any further. And hohaber, learn some manners.

Now back to the question at hand. How do the mathematician who are attempting to speculate on what Google is doing with their algorithm explain the fact that you seem to be able to circumvent it with a search that includes a series of "-nonsense terms".

Many of us in the McDar thread have tested our various keywords both with and without the -kfjks -ldsisdl -ljaffsdl -fjalk -kfaafs -laj -lafksjalj. In each case the sites we are actively working on --ie added links in the last few weeks and months -- rank much higher with the "-nonsense terms".

schlottke
May 22nd 2004, 9:25 am
nohaber-

It is afterall a theory- yet a sound one. You know, at the most, equally as much as the rest of us. You have no place to rip into Hexed the way you did- you are uninformed. Google doesn't need to do it mathmatically- they dont don't follow suit with all algorithm basics. They block results, this by itself is against mathmatics!

schlottke
May 22nd 2004, 9:33 am
"Google applies a dampening effect to new sites specific to the keyword or keyword combination. The amount of dampening and the length of time it takes to be even listed is proportional to the amount of sites that are already listed in the SERPs for those specific keywords."


Well lets think on this, for just one second. "Search Engine Optimization" requires thousands of Backlinks and is Highly competitive. BUYING 10,000 PR8 Links would move you to #1. Google builds into their algo that if a site has 10,000,000 sites and each reaches a specific # set by them to place where they are in the results. Say 1,000,000 of the sites are optimized to a degree-making it difficult to rank high. Google makes this search phrase take longer in the sandbox.

Then go to a median word like "Football Helmets" these results are less optimized for and have fewer sites involving the term- quicker moves to the top.

Now you go with a term like: "nohaber" Nobody is optimizing for this at all. In 2 weeks time your profile here (or on another forum) will jump to the #1 spot because it isn't optimized for and no sites are keying that term.

It makes sense if you use your brain, it really does. I can't say for sure if it is true or not yet but if and when I show up #1 for all my terms in 1-2 months, I'll serve you a plate of crow.

nohaber
May 22nd 2004, 10:24 am
You still don't get the point, do you?

You type "keyword1 keyword2". What happens next? What data structures does google use? What's the complexity of implementing the theory? How much memory does it take?

Can it be run under 0.20 seconds on commodity PCs?

That's my question. Is it technically feasible? If yes, how??? Be specific.

rickbender1940
May 22nd 2004, 10:51 am
I've been doing some thinking and research on the entire sandbox effect, and here's a preliminary thoery that just popped in to my head. I also understand that there have been similar theories on the subject
...
g links too quickly may extend the TTBL and increase the dampening effect - for obvious spammy reasons. This has already been seen with the "dream team" nigritude ultramarine site.

Hexed

Comments Please!

Interesting theory. Have you run any tests to confirm parts of it. We'd be interested in hearing the results.

rickbender1940
May 22nd 2004, 10:52 am
Interesting theory. Have you run any tests to confirm parts of it. We'd be interested in hearing the results.

Actually, why don't you post in Webmasterworld. There's a lot of big-time SEO's in there, you'd get quick feedback if anybody has data supporting the theory. And post your Webmasterworld username so we can follow along

Foxy
May 22nd 2004, 10:54 am
nohaber you are out of order

There is a way and a method of querying and accepting the evidence or not - the way that you have come in here smacks of being on the drink - do you still?

If you have ever been to university - which I doubt - you would know that people are allowed to propose theories without being "put down". Consensus determines that it is accepted or not - not you as a person.

So I am so incensed by this attack and in the interests of keeping this a friendly and technical forum [now how do I feel there seems to be a lot of aggro arriving recently as this forum beomes known as being interesting and free spirited?] I am going to tell you a bit about me which, as a private person I don't normally do, but you nohaber should.

I am from an age where we used cards to load in information on IBM 360/30's in a language called Fortran 4 and PL1 was a new kid on the block and assembler was, well, assembler, and micros and the rest were not to appear until Apple in 1978.

I am a mathematician, and biochemist and rower, and sailer and skier and cook and...I didn't do the programming thing because I found it ...stultifying [enough to send you to drink - as you did].....but I found with the Internet in the 80s [were you born] that there was something that might interest me but it took until 1999 for me to use this medium.

Now when you [that is you nohaber], have spent time considering the world a little more widely, as I have and many others, and that this theory of Hexed does have merit - well thought through - and backed up by practical examples from my sites as well as others [called consensus] then do try to behave as though you were trained by a University and not by a fish wife.

Foxy
May 22nd 2004, 11:03 am
Actually, why don't you post in Webmasterworld. There's a lot of big-time SEO's in there, you'd get quick feedback if anybody has data supporting the theory. And post your Webmasterworld username so we can follow along

Why doesn't he just stay here in the better forum and wait?

What "big-time SEO's in there" ? In Webmasterworld? Oh don't make me laugh!

Goodness me where did you come from?

Are you another worried about how good this forum is? :mad:

rickbender1940
May 22nd 2004, 11:12 am
Why doesn't he just stay here in the better forum and wait?

What "big-time SEO's in there" ? In Webmasterworld? Oh don't make me laugh!

Goodness me where did you come from?

Are you another worried about how good this forum is? :mad:

Not at all. I just want to see the biggest pool of data. Wherever the info comes from, is where I am. And there are some guys in there running BIG sites and collections of sites. Some of the affiliate guys are apparently making 5 figures. Mind you, WMW has an annoying policy of no URL's, no specific keywords etc. And a lot of nonsense posted about "Just make your content good, the ranking will come"!!

compar
May 22nd 2004, 11:17 am
Actually, why don't you post in Webmasterworld. There's a lot of big-time SEO's in there, you'd get quick feedback if anybody has data supporting the theory. And post your Webmasterworld username so we can follow along
There maybe a "lot of big-time SEO's" on Webmasterworld, but there are lot of damn good SEOs on this forum. I agree with Foxy, let the "big-time SEO's" come to us. They might learn something.

nohaber
May 22nd 2004, 11:24 am
foxyweb,
nice theory. But it can't run under 0.20 seconds.

I appologize if I appeared to be rude.

foxyweb,
i've participated in many programming contests during my high school and university years. On these contests, you are assigned problems, and there are time limits for their solution. If your program does not run under lets say 1 second, you get no points. On these competitions you learn to know what's possible. I KNOW what can run under 0.20 seconds. And hexed theory is out of order. It will also require monstrous amount of memory.

That means that even if google wants to do what hexed has written, it can't implement it so that it runs under 0.20 seconds. Did I make myself clear now?

compar
May 22nd 2004, 11:36 am
That means that even if google wants to do what hexed has written, it can't implement it so that it runs under 0.20 seconds.
I tend to agree with you on this point. In fact it is this very point I believe that stops Google from implementing a great many of the tools or techniques that people think they are using such as hilltop and local rank etc.

But you have to remember that Google is purported to have the largest computer network in the world, so they may be able to do things we would normally believe impossible. I'm not a programmer, but would it not be possible with all their computing power and servers to precalculate a great deal of ranking information?

Certainly with popular keyword terms this would seem to be possible. This may even explain why the "-nonesense terms" search works. They haven't got anything precalculated for that and have to give you back the raw results for the keyword??????

nohaber
May 22nd 2004, 11:55 am
compar,
you are getting what I am trying to say. It is my fault, that I didn't say it as it should have been. The point is most SEOs try to observe, make theories, but in the end, they don't know what's possible and technically feasible.

LocalRank is feasible and can be implemented by the way. It can run well under 0.20 seconds.

Hilltop is not used. LocalRank patents extend the Hilltop idea, but Hilltop is different from LocalRank (Hilltop looks for expert documents by its own criteria, while LocalRank assumes that the first ranking stage(pagerank+IR score) has ordered the documents by "expertness"). If it isn't for LocalRank, then dmoz will be #1 for many queries. Basically, dmoz pushes all the sites it lists up by LocalRank.

Re: about the computing power. Well, google might have 10000000 computers, but they help only if two tasks can be done in parallel (on 2 computers). If a part of the algorithm needs to wait for another part to finish, it all comes down to how fast is the computer that runs the task.

The "-nonsense words" does not support hexed theory or any other theory. I'd say it is a peculiarity of google's algo. The "-.." probably makes google skip a step in the algo. After a while, they'll fix this (just as happened with the Florida update). Plus, almost noone searches with "-nonsense terms" :)

Foxy
May 22nd 2004, 12:10 pm
foxyweb,
nice theory. But it can't run under 0.20 seconds.

I appologize if I appeared to be rude.

foxyweb,
i've participated in many programming contests during my high school and university years. On these contests, you are assigned problems, and there are time limits for their solution. If your program does not run under lets say 1 second, you get no points. On these competitions you learn to know what's possible. I KNOW what can run under 0.20 seconds. And hexed theory is out of order. It will also require monstrous amount of memory.

That means that even if google wants to do what hexed has written, it can't implement it so that it runs under 0.20 seconds. Did I make myself clear now?

Say that to Hexed

and we are all back in business and welcome to the best forum :)

dazzlindonna
May 22nd 2004, 1:05 pm
Every time I read this thread, I get an image of a disease based on mass hysteria. And when Google makes major algo changes, I think that image isn't too far off for many webmasters. Ok, sorry, I know this doesn't contribute anything to this thread, but the whole concept is way above my head. Sounds feasible to me, though, but then again, so does the probability of a hysterical-disease-ridden algo. :) Carry on.

schlottke
May 22nd 2004, 1:22 pm
On March 31st it wasn't feasible for a free email provider to offer a gig of space either. Weird.

Also, I don't think it is a disease ridden algo- I think if they are using these methods it better because they can bring down the manipulation of terms. For instance people that once bought links for the christmas season on "christmas gifts" would now need to waste 3 times the money on those links before they'd reach the results- It wouldn't take any further load time, google would find the results and rank them the same way, adding additional criteria aswell. it could be done quickly still- no question. 4 years ago you would have thought stemming and 4 billion pages were off the wall.

Foxy
May 22nd 2004, 1:42 pm
it all comes down to how fast is the computer that runs the task.


Isn't it Illinois who runs 1000 G5s [macs to you who don't know] to be the 2nd fastest computer in the world?

Thats just a bundle of pcs isn't it?

Now if I remember correctly the algo is applied to the page at the "bot" stage is it not?

So the retrieval does not require the algo to run does it not?

So the speed is not based on the need to run the algo does it not?

Correct me if I am wrong

dazzlindonna
May 22nd 2004, 1:58 pm
I don't really think it is a disease-ridden algo. That was just my attempt at humor.

schlottke
May 22nd 2004, 2:10 pm
Ha, sorry about that didn't really read it well I guess

compar
May 22nd 2004, 2:10 pm
The "-.." probably makes google skip a step in the algo. After a while, they'll fix this (just as happened with the Florida update). Plus, almost noone searches with "-nonsense terms" :)
I wasn't suggesting that it supported any theory. Except my own suggestion of the possible use of precalculated dampening.

And of course nobody searches on "-nonsense terms". It is hardly necsessary to state the obvious. The use of the "-nonsense terms" is for SEOs or webmasters trying to predict the position of their page when it gets by the dampening period.

nohaber
May 22nd 2004, 2:26 pm
First,
I'd like to appologize to hexed. I realize I was rude. Hexed obviously wants to help people, and he didn't deserve my comments.

foxyweb,
google runs on thousands of commodity PCs. Read "The Anatomy of a Large-Scale Hypertextual Web Search Engine" paper by Sergey Brin and Lawrence Page and you'll learn much more than reading SEO articles.

"Now if I remember correctly the algo is applied to the page at the "bot" stage is it not?" NO. It would mean that the algo should be applied to all possible queries in whose ranking a page can participate + it would require that google stores ranking info about all possible queries which would take sooo much memory.

"So the retrieval does not require the algo to run does it not?
So the speed is not based on the need to run the algo does it not?"
I can't understand this.

Basically when you type a query "keyword1 keyword2" google uses the inverted index which list all documents which contain a keyword or anchor text of a link pointing to a document. So google gets the list of all documents that contain "keyword1" and merges it with the list of documents containing "keyword2" to find the intersection (it is a list of numbers identifying documents). These lists are presorted by docID. They are presorted because this allows very fast merging of the lists. Then google scans the so called hit lists, calculates an IR (info retrieval) score, combines it with PageRank, and finally reranks the top 1000 using LocalRank. Of course, that's oversimplified. It is very well explained in the above paper. There are lots of changes. The original google for example put equal weight to all anchor text hits (it didn't matter if the link was coming from PR 0 or PR4 page). If you read the paper, you'll realize that google never used "keyword density" as many "experts" falsely believe. That's why I am getting mad at times at SEOs :)

At the crawling stage a lot of other things happen, such as detecting duplicate documents, affiliated hosts etc.etc.

It's all in the papers and patents.

Foxy
May 23rd 2004, 2:08 am
First,
I'd like to appologize to hexed. I realize I was rude. Hexed obviously wants to help people, and he didn't deserve my comments.


Well said - the sign of a true gentleman.


foxyweb,
google runs on thousands of commodity PCs. Read "The Anatomy of a Large-Scale Hypertextual Web Search Engine" paper by Sergey Brin and Lawrence Page and you'll learn much more than reading SEO articles.


I have actually - and I don't read SEO articles endlessly but I do try to find which way Google is going today and what weight is put on each factor, and, I have accepted this forum as a "free" thinking forum that experiments with ideas.


"Now if I remember correctly the algo is applied to the page at the "bot" stage is it not?" NO. It would mean that the algo should be applied to all possible queries in whose ranking a page can participate + it would require that google stores ranking info about all possible queries which would take sooo much memory.


In this I believe you are incorrect so I went back to the paper and pulled the pieces that are relevant

to wit: 1.3.2......Google stores all of the actual documents it crawls in compressed form.

and 2.1.1:
....PageRank for 26 million web pages can be computed in a few hours on a medium size workstation.... which means, as we know, that it is precalculated as are all factors and placed in a compressed form for retrieval.

and 3/4:

..The sorter takes the barrels, which are sorted by docID (this is a simplification, see Section 4.2.5), and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.

4.2 Major Data Structures
Google's data structures are optimized so that a large document collection can be crawled, indexed, and searched with little cost. Although, CPUs and bulk input output rates have improved dramatically over the years, a disk seek still requires about 10 ms to complete. Google is designed to avoid disk seeks whenever possible, and this has had a considerable influence on the design of the data structures.....

Now from above this phrase "The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries."

This is NOT the algo this is the retrieval and is where we are not agreeing as this "search" relies on minimal disc usage [one per search] and therefore [certainly with todays cpu speed being much higher than when this paper was written] where Google is able to perform greater "filtering" at the front end without any loss of speed - unfortunately I don't have on record any retrieval figure comparisons from 10years ago to be able to state categorically but comonsense dictates that this would be the case.


"So the retrieval does not require the algo to run does it not?
So the speed is not based on the need to run the algo does it not?"
I can't understand this.

Basically when you type a query "keyword1 keyword2" google uses the inverted index which list all documents which contain a keyword or anchor text of a link pointing to a document. So google gets the list of all documents that contain "keyword1" and merges it with the list of documents containing "keyword2" to find the intersection (it is a list of numbers identifying documents). These lists are presorted by docID. They are presorted because this allows very fast merging of the lists. Then google scans the so called hit lists, calculates an IR (info retrieval) score, combines it with PageRank, and finally reranks the top 1000 using LocalRank. Of course, that's oversimplified. It is very well explained in the above paper. There are lots of changes. The original google for example put equal weight to all anchor text hits (it didn't matter if the link was coming from PR 0 or PR4 page). If you read the paper, you'll realize that google never used "keyword density" as many "experts" falsely believe. That's why I am getting mad at times at SEOs :)

At the crawling stage a lot of other things happen, such as detecting duplicate documents, affiliated hosts etc.etc.

It's all in the papers and patents.

Which should now be explained - I think :)

nohaber
May 23rd 2004, 3:23 am
foxyweb,
surely, PageRank is precalculated, but PageRank is a query-independant factor. So you have PageRank before you have the query keywords ;)
But the actual ranking is a combination of PageRank and the IR (info retrieval score) or matching between the keywords and the hit lists. The IR score is not precalculated, because that would mean calculating and storing info about every possible query.
Basically google needs to read the inverted index to get the list of documents and the hit lists in one read. Merge the lists to get the candidate docs, and process the hit lists, and combine IR with PageRank. Now the web has grown, and google uses more than one read operation. But it's done on different PCs.

But you are wrong. Google does not store query information, such as do this for this query. It's unfeasible to patch specific queries. It's much more productive to concentrate on the ranking algo. Plus, it takes a lot of space to store info about every query.

compar
May 23rd 2004, 7:52 am
IR (info retrieval score)
I've seen you mention this IR factor a number of times now and I have no idea what you are referring to. Can you explain this please?

nohaber
May 23rd 2004, 8:12 am
IR means Information Retrieval. It's simply matching documents to keywords. Basically, you look for instances of the search keywords in the text and the more you find, and the closer they are to each other (for multiword queries), the higher the IR score.

As google put it, the IR score guarantees specificity to the query and PageRank guarantees quality. The IR score is easy to bump up by just repeating the keywords, but PageRank is another story :) When you combine the two, you get the best search engine in the world.

Voyager
May 23rd 2004, 9:05 pm
"If Google has gone adaptive, they will weigh the different regulation and threshold variables depending on the competition for the keyword and how much movements there is in the serps for that specific keyword."

Laiwa:

My objection to this is that keywords are created dynamically.

Webmaster think in terms of a finite set of keywords, as in "these are my three keywords". Users invent new keyphrases constantly -- and it really is all about key phrases, not key words.

One user found my web page with the key phrase "phone number alberta live saskatchewan calling".

I do not believe that the behavior of the Google algorithm varies depending upon the keywords put through it.

However, one caveat, the behavior of the Google algorithm does demonstrably vary based upon the number of words in a phrase.

Foxy
May 23rd 2004, 11:14 pm
I went back and got the quote below because I didn't say this

"But you are wrong. Google does not store query information, such as do this for this query. It's unfeasible to patch specific queries. It's much more productive to concentrate on the ranking algo. Plus, it takes a lot of space to store info about every query."

But it is relevant to what we are discussing

foxyweb,
nice theory. But it can't run under 0.20 seconds.

I appologize if I appeared to be rude.

foxyweb,
i've participated in many programming contests during my high school and university years. On these contests, you are assigned problems, and there are time limits for their solution. If your program does not run under lets say 1 second, you get no points. On these competitions you learn to know what's possible. I KNOW what can run under 0.20 seconds. And hexed theory is out of order. It will also require monstrous amount of memory.

That means that even if google wants to do what hexed has written, it can't implement it so that it runs under 0.20 seconds. Did I make myself clear now?

You see, Hexed said that this hysteresis factor was applied to the Keyword keyword phrases - and it occured to me, that as you were being very determined about the speed of the retrieval, that there may have been a basic misunderstanding as to where this factor was applied.

Hexed, as I understand it, meant it was applied by Google on your own pages to the keywords/keywords phrases at the "first level" in Google, the gestation period - that is the actual determination by Google in its algorithms as to how the page might perform for that time when a researcher does a search on Google and initiates the retrieval sequence which collects the PR and IR and whatever else to produce the page results.
:)

Chatmaster
May 24th 2004, 12:14 am
This is one of the most usefull posts I have ever read. Nohaber and Foxyweb well done with this sivilised discussion. Hexed I believe your theory to be correct or very much possible. First I want to state that I am no match for any of you when it comes to programming experience as all my programming experience is self taught and only 7 years. But what makes sense to me is that with the spidering of a webpage a few things needs to happen before the page are excepted in the Google SERPS. The page needs to be "cleared" in a sense. Checked for spam, IP, duplicate content etc. I believe this is where the sand box effect takes place. There after it has to go through a certain process that will release it from the sandbox and displayed together with the algo results displayed. The actual algo will mainly serve as a query relevant to the user's request in the allowed sites thus making it possible to run in under .2 secs

nohaber
May 24th 2004, 5:03 am
chatmaster,
I am telling you, (and google has many times said it) - There's nothing specifically done to patch certain queries (as hexed proposes).
There might be different reasons for the sandbox effect some sites experience. I like the idea of weighting a link by its age. Like when google discovers a new interhost link, it gives it let's say 25% weight and waits some time before giving 100%. It will prevent SERP spikes when sites get listed in "what's new pages" etc. Other things to note: PageRank is recalculated periodically. It's a time consuming process. If PageRank is recalculated once in 3 months, it may slow down SERP changes.

Google might periodically reevaluate interhost links by affiliation/relatedness. For example: when two hosts have very similar outgoing links they look like mirror and affiliated hosts. When you get links from affiliated host, google might ignore all except the highest rank link (for example: the links from all dmoz clones might be ignored).

Another thing is cocitation. When two sites get links from a set of other sites, these two sites are related.

There are lots of possible things. I am saying that google does not do special things for specific queries. I think google evolves in the way that it looks at inter-host links.

Check out this paper from Krishna Bharat and two more google employees:
"Who Links to Whom: Mining Linkage between Web Sites"
download from here: http://searchwell.com/krishna/publications.html

schlottke
May 24th 2004, 7:50 am
" am telling you, (and google has many times said it) - There's nothing specifically done to patch certain queries (as hexed proposes).
There might be different reasons for the sandbox effect some sites experience. I like the idea of weighting a link by its age. Like when google discovers a new interhost link, it gives it let's say 25% weight and waits some time before giving 100%. It will prevent SERP spikes when sites get listed in "what's new pages" etc. Other things to note: PageRank is recalculated periodically. It's a time consuming process. If PageRank is recalculated once in 3 months, it may slow down SERP changes."

You totally contridicted yourself. It is obvious google doesn't actually Hand pick queries and alter them, that isn't what Hexed is saying. Hexed is saying that all sites receive different "sandbox" lengths based on the age/optimization of the sites involved in the search. It would stand to reason that these results (AGE AND OPTIMIZATION) would go hand in hand. The older they are the more back links they would natually have (without buying them, obviously- thus the reason for the "sandbox effect" if it did exist.

nohaber
May 24th 2004, 9:16 am
schlottke,
I don't contradict myself. According to hexed, a page will not rank for a considerable time for competitive keywords, but will rank sooner for non-competitive keywords. This is KEYWORD-SPECIFIC. If google is to implement hexed theory, it has to store info about every possible query (and they are many). Google will have to know for every page in the index the first time it appeared in the SERPs etc.etc. This is storing info about specific keywords.

What I have written is LINK-SPECIFIC. It won't matter if you want your page ranked for competitive or non-competitive terms. I am talking about the value of a LINK between different hosts. I am talking about the time when Google updates its inverted index, which includes anchor hits and the time when it recalculated PageRank. So, google may show you some PageRank, but internally it may use a totally different modified PR. Also, the anchor hits may be given different values. What I am saying has nothing to do with queries. Although, let's say google may devalue new links, you can still get a top ranking for new page, if you have a huge PR. What I imply does not require more memory. It is simply another step in the algo, which is done "in-house" before index update. It won't affect the reponse time and memory requirements. What I imply just changes the data in the index. Because the inverted index, ALREADY has the capability of classifying the value of links.

You see. I make a hypothesis that tries to explain the same phenomenon as I and hexed see. But, mine is technically feasible. That's the difference.

Chatmaster
May 26th 2004, 1:56 am
schlottke,
I don't contradict myself. According to hexed, a page will not rank for a
You just did! :cool:

nohaber
May 26th 2004, 5:36 am
How did I contradict myself, Chatmaster?

payoutwindow
Mar 16th 2005, 6:48 am
raising the dead .. very nice thread .. noobs should enjoy.

inthedark
Mar 16th 2005, 8:06 am
so what exactly is the moral of the story here?

that the sandbox is a function of Google only updating page rank once per quarter vs monthly plus possibly a link dampening factor?

Chrissicom
Mar 16th 2005, 11:34 pm
I also firmly believe that gathering links too quickly may extend the TTBL and increase the dampening effect - for obvious spammy reasons. This has already been seen with the "dream team" nigritude ultramarine site.

I think this is the most stupid thing I have read in the whole article. Sorry when I sound confusing I don't think that this is wrong, I think that Google is doing something really stupid!

This means changing your domain will most likely put you in the sandbox, even when you change an indexed GeoCities site to a TLD which is not that stupid. It also means that Google "doesn't allow" you to advertise your website as long as they don't list it and it's new. I would totally ignore that as a webmaster and don't care about Google rankings, I get enough SE traffic from Yahoo and MSN. Since most traffic is from website links and link exchanges with similar websites though I just get as many as possible and also for a new website because it just delivers a lot more traffic than a lousy good Google ranking that's much harder to get than a good link exchange.

Starfighter
Dec 1st 2007, 6:25 am
Now, I do not know how to explain this. For years operating systems have been treated by me as a child. When they do not behave properly, I make adjustments.

Has any one out there had there sites come up 27 time in a row on the forst three pages of Google?

Lets talk about not getting 1 Domain to the top of the list, but lets talk about getting 100 Domains to the top of all lists.

Now why sell one Web Site, when you can sell a Hundread at one time.

Any questions. One more thing. Integrity and Honor are virtues that I employ. Any thing that I own o possess I consider a gift, if I have a place to live food in the fredge and clothes on my baxck I am content.

So the propsed question is, has any on eout there have experiance in Multible Domain Name Rankings?

Starfighter :eek: