1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Google "Sandbox" Threshold with "Hysteresis"

Discussion in 'Search Engine Optimization' started by hexed, May 21, 2004.

  1. nohaber

    nohaber Well-Known Member

    Messages:
    276
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    138
    #21
    foxyweb,
    nice theory. But it can't run under 0.20 seconds.

    I appologize if I appeared to be rude.

    foxyweb,
    i've participated in many programming contests during my high school and university years. On these contests, you are assigned problems, and there are time limits for their solution. If your program does not run under lets say 1 second, you get no points. On these competitions you learn to know what's possible. I KNOW what can run under 0.20 seconds. And hexed theory is out of order. It will also require monstrous amount of memory.

    That means that even if google wants to do what hexed has written, it can't implement it so that it runs under 0.20 seconds. Did I make myself clear now?
     
    nohaber, May 22, 2004 IP
  2. compar

    compar Peon

    Messages:
    2,705
    Likes Received:
    169
    Best Answers:
    0
    Trophy Points:
    0
    #22
    I tend to agree with you on this point. In fact it is this very point I believe that stops Google from implementing a great many of the tools or techniques that people think they are using such as hilltop and local rank etc.

    But you have to remember that Google is purported to have the largest computer network in the world, so they may be able to do things we would normally believe impossible. I'm not a programmer, but would it not be possible with all their computing power and servers to precalculate a great deal of ranking information?

    Certainly with popular keyword terms this would seem to be possible. This may even explain why the "-nonesense terms" search works. They haven't got anything precalculated for that and have to give you back the raw results for the keyword??????
     
    compar, May 22, 2004 IP
  3. nohaber

    nohaber Well-Known Member

    Messages:
    276
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    138
    #23
    compar,
    you are getting what I am trying to say. It is my fault, that I didn't say it as it should have been. The point is most SEOs try to observe, make theories, but in the end, they don't know what's possible and technically feasible.

    LocalRank is feasible and can be implemented by the way. It can run well under 0.20 seconds.

    Hilltop is not used. LocalRank patents extend the Hilltop idea, but Hilltop is different from LocalRank (Hilltop looks for expert documents by its own criteria, while LocalRank assumes that the first ranking stage(pagerank+IR score) has ordered the documents by "expertness"). If it isn't for LocalRank, then dmoz will be #1 for many queries. Basically, dmoz pushes all the sites it lists up by LocalRank.

    Re: about the computing power. Well, google might have 10000000 computers, but they help only if two tasks can be done in parallel (on 2 computers). If a part of the algorithm needs to wait for another part to finish, it all comes down to how fast is the computer that runs the task.

    The "-nonsense words" does not support hexed theory or any other theory. I'd say it is a peculiarity of google's algo. The "-.." probably makes google skip a step in the algo. After a while, they'll fix this (just as happened with the Florida update). Plus, almost noone searches with "-nonsense terms" :)
     
    nohaber, May 22, 2004 IP
  4. Foxy

    Foxy Chief Natural Foodie

    Messages:
    1,614
    Likes Received:
    48
    Best Answers:
    0
    Trophy Points:
    0
    #24
    Say that to Hexed

    and we are all back in business and welcome to the best forum :)
     
    Foxy, May 22, 2004 IP
  5. dazzlindonna

    dazzlindonna Peon

    Messages:
    553
    Likes Received:
    21
    Best Answers:
    0
    Trophy Points:
    0
    #25
    Every time I read this thread, I get an image of a disease based on mass hysteria. And when Google makes major algo changes, I think that image isn't too far off for many webmasters. Ok, sorry, I know this doesn't contribute anything to this thread, but the whole concept is way above my head. Sounds feasible to me, though, but then again, so does the probability of a hysterical-disease-ridden algo. :) Carry on.
     
    dazzlindonna, May 22, 2004 IP
  6. schlottke

    schlottke Peon

    Messages:
    2,185
    Likes Received:
    63
    Best Answers:
    0
    Trophy Points:
    0
    #26
    On March 31st it wasn't feasible for a free email provider to offer a gig of space either. Weird.

    Also, I don't think it is a disease ridden algo- I think if they are using these methods it better because they can bring down the manipulation of terms. For instance people that once bought links for the christmas season on "christmas gifts" would now need to waste 3 times the money on those links before they'd reach the results- It wouldn't take any further load time, google would find the results and rank them the same way, adding additional criteria aswell. it could be done quickly still- no question. 4 years ago you would have thought stemming and 4 billion pages were off the wall.
     
    schlottke, May 22, 2004 IP
  7. Foxy

    Foxy Chief Natural Foodie

    Messages:
    1,614
    Likes Received:
    48
    Best Answers:
    0
    Trophy Points:
    0
    #27
    Isn't it Illinois who runs 1000 G5s [macs to you who don't know] to be the 2nd fastest computer in the world?

    Thats just a bundle of pcs isn't it?

    Now if I remember correctly the algo is applied to the page at the "bot" stage is it not?

    So the retrieval does not require the algo to run does it not?

    So the speed is not based on the need to run the algo does it not?

    Correct me if I am wrong
     
    Foxy, May 22, 2004 IP
  8. dazzlindonna

    dazzlindonna Peon

    Messages:
    553
    Likes Received:
    21
    Best Answers:
    0
    Trophy Points:
    0
    #28
    I don't really think it is a disease-ridden algo. That was just my attempt at humor.
     
    dazzlindonna, May 22, 2004 IP
  9. schlottke

    schlottke Peon

    Messages:
    2,185
    Likes Received:
    63
    Best Answers:
    0
    Trophy Points:
    0
    #29
    Ha, sorry about that didn't really read it well I guess
     
    schlottke, May 22, 2004 IP
  10. compar

    compar Peon

    Messages:
    2,705
    Likes Received:
    169
    Best Answers:
    0
    Trophy Points:
    0
    #30
    I wasn't suggesting that it supported any theory. Except my own suggestion of the possible use of precalculated dampening.

    And of course nobody searches on "-nonsense terms". It is hardly necsessary to state the obvious. The use of the "-nonsense terms" is for SEOs or webmasters trying to predict the position of their page when it gets by the dampening period.
     
    compar, May 22, 2004 IP
  11. nohaber

    nohaber Well-Known Member

    Messages:
    276
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    138
    #31
    First,
    I'd like to appologize to hexed. I realize I was rude. Hexed obviously wants to help people, and he didn't deserve my comments.

    foxyweb,
    google runs on thousands of commodity PCs. Read "The Anatomy of a Large-Scale Hypertextual Web Search Engine" paper by Sergey Brin and Lawrence Page and you'll learn much more than reading SEO articles.

    "Now if I remember correctly the algo is applied to the page at the "bot" stage is it not?" NO. It would mean that the algo should be applied to all possible queries in whose ranking a page can participate + it would require that google stores ranking info about all possible queries which would take sooo much memory.

    "So the retrieval does not require the algo to run does it not?
    So the speed is not based on the need to run the algo does it not?"
    I can't understand this.

    Basically when you type a query "keyword1 keyword2" google uses the inverted index which list all documents which contain a keyword or anchor text of a link pointing to a document. So google gets the list of all documents that contain "keyword1" and merges it with the list of documents containing "keyword2" to find the intersection (it is a list of numbers identifying documents). These lists are presorted by docID. They are presorted because this allows very fast merging of the lists. Then google scans the so called hit lists, calculates an IR (info retrieval) score, combines it with PageRank, and finally reranks the top 1000 using LocalRank. Of course, that's oversimplified. It is very well explained in the above paper. There are lots of changes. The original google for example put equal weight to all anchor text hits (it didn't matter if the link was coming from PR 0 or PR4 page). If you read the paper, you'll realize that google never used "keyword density" as many "experts" falsely believe. That's why I am getting mad at times at SEOs :)

    At the crawling stage a lot of other things happen, such as detecting duplicate documents, affiliated hosts etc.etc.

    It's all in the papers and patents.
     
    nohaber, May 22, 2004 IP
  12. Foxy

    Foxy Chief Natural Foodie

    Messages:
    1,614
    Likes Received:
    48
    Best Answers:
    0
    Trophy Points:
    0
    #32
    Well said - the sign of a true gentleman.

    I have actually - and I don't read SEO articles endlessly but I do try to find which way Google is going today and what weight is put on each factor, and, I have accepted this forum as a "free" thinking forum that experiments with ideas.

    In this I believe you are incorrect so I went back to the paper and pulled the pieces that are relevant

    to wit: 1.3.2......Google stores all of the actual documents it crawls in compressed form.

    and 2.1.1:
    ....PageRank for 26 million web pages can be computed in a few hours on a medium size workstation.... which means, as we know, that it is precalculated as are all factors and placed in a compressed form for retrieval.

    and 3/4:

    ..The sorter takes the barrels, which are sorted by docID (this is a simplification, see Section 4.2.5), and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.

    4.2 Major Data Structures
    Google's data structures are optimized so that a large document collection can be crawled, indexed, and searched with little cost. Although, CPUs and bulk input output rates have improved dramatically over the years, a disk seek still requires about 10 ms to complete. Google is designed to avoid disk seeks whenever possible, and this has had a considerable influence on the design of the data structures.....

    Now from above this phrase "The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries."

    This is NOT the algo this is the retrieval and is where we are not agreeing as this "search" relies on minimal disc usage [one per search] and therefore [certainly with todays cpu speed being much higher than when this paper was written] where Google is able to perform greater "filtering" at the front end without any loss of speed - unfortunately I don't have on record any retrieval figure comparisons from 10years ago to be able to state categorically but comonsense dictates that this would be the case.

    Which should now be explained - I think :)
     
    Foxy, May 23, 2004 IP
  13. nohaber

    nohaber Well-Known Member

    Messages:
    276
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    138
    #33
    foxyweb,
    surely, PageRank is precalculated, but PageRank is a query-independant factor. So you have PageRank before you have the query keywords ;)
    But the actual ranking is a combination of PageRank and the IR (info retrieval score) or matching between the keywords and the hit lists. The IR score is not precalculated, because that would mean calculating and storing info about every possible query.
    Basically google needs to read the inverted index to get the list of documents and the hit lists in one read. Merge the lists to get the candidate docs, and process the hit lists, and combine IR with PageRank. Now the web has grown, and google uses more than one read operation. But it's done on different PCs.

    But you are wrong. Google does not store query information, such as do this for this query. It's unfeasible to patch specific queries. It's much more productive to concentrate on the ranking algo. Plus, it takes a lot of space to store info about every query.
     
    nohaber, May 23, 2004 IP
  14. compar

    compar Peon

    Messages:
    2,705
    Likes Received:
    169
    Best Answers:
    0
    Trophy Points:
    0
    #34
    I've seen you mention this IR factor a number of times now and I have no idea what you are referring to. Can you explain this please?
     
    compar, May 23, 2004 IP
  15. nohaber

    nohaber Well-Known Member

    Messages:
    276
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    138
    #35
    IR means Information Retrieval. It's simply matching documents to keywords. Basically, you look for instances of the search keywords in the text and the more you find, and the closer they are to each other (for multiword queries), the higher the IR score.

    As google put it, the IR score guarantees specificity to the query and PageRank guarantees quality. The IR score is easy to bump up by just repeating the keywords, but PageRank is another story :) When you combine the two, you get the best search engine in the world.
     
    nohaber, May 23, 2004 IP
  16. Voyager

    Voyager Guest

    Messages:
    46
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #36
    "If Google has gone adaptive, they will weigh the different regulation and threshold variables depending on the competition for the keyword and how much movements there is in the serps for that specific keyword."

    Laiwa:

    My objection to this is that keywords are created dynamically.

    Webmaster think in terms of a finite set of keywords, as in "these are my three keywords". Users invent new keyphrases constantly -- and it really is all about key phrases, not key words.

    One user found my web page with the key phrase "phone number alberta live saskatchewan calling".

    I do not believe that the behavior of the Google algorithm varies depending upon the keywords put through it.

    However, one caveat, the behavior of the Google algorithm does demonstrably vary based upon the number of words in a phrase.
     
    Voyager, May 23, 2004 IP
  17. Foxy

    Foxy Chief Natural Foodie

    Messages:
    1,614
    Likes Received:
    48
    Best Answers:
    0
    Trophy Points:
    0
    #37
    I went back and got the quote below because I didn't say this

    "But you are wrong. Google does not store query information, such as do this for this query. It's unfeasible to patch specific queries. It's much more productive to concentrate on the ranking algo. Plus, it takes a lot of space to store info about every query."

    But it is relevant to what we are discussing

    You see, Hexed said that this hysteresis factor was applied to the Keyword keyword phrases - and it occured to me, that as you were being very determined about the speed of the retrieval, that there may have been a basic misunderstanding as to where this factor was applied.

    Hexed, as I understand it, meant it was applied by Google on your own pages to the keywords/keywords phrases at the "first level" in Google, the gestation period - that is the actual determination by Google in its algorithms as to how the page might perform for that time when a researcher does a search on Google and initiates the retrieval sequence which collects the PR and IR and whatever else to produce the page results.
    :)
     
    Foxy, May 23, 2004 IP
  18. Chatmaster

    Chatmaster Guest

    Messages:
    56
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #38
    This is one of the most usefull posts I have ever read. Nohaber and Foxyweb well done with this sivilised discussion. Hexed I believe your theory to be correct or very much possible. First I want to state that I am no match for any of you when it comes to programming experience as all my programming experience is self taught and only 7 years. But what makes sense to me is that with the spidering of a webpage a few things needs to happen before the page are excepted in the Google SERPS. The page needs to be "cleared" in a sense. Checked for spam, IP, duplicate content etc. I believe this is where the sand box effect takes place. There after it has to go through a certain process that will release it from the sandbox and displayed together with the algo results displayed. The actual algo will mainly serve as a query relevant to the user's request in the allowed sites thus making it possible to run in under .2 secs
     
    Chatmaster, May 24, 2004 IP
  19. nohaber

    nohaber Well-Known Member

    Messages:
    276
    Likes Received:
    18
    Best Answers:
    0
    Trophy Points:
    138
    #39
    chatmaster,
    I am telling you, (and google has many times said it) - There's nothing specifically done to patch certain queries (as hexed proposes).
    There might be different reasons for the sandbox effect some sites experience. I like the idea of weighting a link by its age. Like when google discovers a new interhost link, it gives it let's say 25% weight and waits some time before giving 100%. It will prevent SERP spikes when sites get listed in "what's new pages" etc. Other things to note: PageRank is recalculated periodically. It's a time consuming process. If PageRank is recalculated once in 3 months, it may slow down SERP changes.

    Google might periodically reevaluate interhost links by affiliation/relatedness. For example: when two hosts have very similar outgoing links they look like mirror and affiliated hosts. When you get links from affiliated host, google might ignore all except the highest rank link (for example: the links from all dmoz clones might be ignored).

    Another thing is cocitation. When two sites get links from a set of other sites, these two sites are related.

    There are lots of possible things. I am saying that google does not do special things for specific queries. I think google evolves in the way that it looks at inter-host links.

    Check out this paper from Krishna Bharat and two more google employees:
    "Who Links to Whom: Mining Linkage between Web Sites"
    download from here: http://searchwell.com/krishna/publications.html
     
    nohaber, May 24, 2004 IP
  20. schlottke

    schlottke Peon

    Messages:
    2,185
    Likes Received:
    63
    Best Answers:
    0
    Trophy Points:
    0
    #40
    " am telling you, (and google has many times said it) - There's nothing specifically done to patch certain queries (as hexed proposes).
    There might be different reasons for the sandbox effect some sites experience. I like the idea of weighting a link by its age. Like when google discovers a new interhost link, it gives it let's say 25% weight and waits some time before giving 100%. It will prevent SERP spikes when sites get listed in "what's new pages" etc. Other things to note: PageRank is recalculated periodically. It's a time consuming process. If PageRank is recalculated once in 3 months, it may slow down SERP changes."

    You totally contridicted yourself. It is obvious google doesn't actually Hand pick queries and alter them, that isn't what Hexed is saying. Hexed is saying that all sites receive different "sandbox" lengths based on the age/optimization of the sites involved in the search. It would stand to reason that these results (AGE AND OPTIMIZATION) would go hand in hand. The older they are the more back links they would natually have (without buying them, obviously- thus the reason for the "sandbox effect" if it did exist.
     
    schlottke, May 24, 2004 IP