Hi all, I'm very very interested in knowing how Google works. Yes, I know about PageRank and stuff, but I thought I'd bank in on the collective wisdom of all DP forummers in knowing more about Google. Here are some key concepts that is affiliated with Google: PageRank A citation-based ranking system. A PR is actually an assigned code for the probability a random surfer will surf to your site. The random surfer is assumed to be of iid-distribution. It is based on the number of links your site has, and the number of citation your site makes. It makes use of Markov chains. MapReduce and BigTable+GFS+BerkleyDB HA They're methods for parallel computing and database storing. Normal RDBMS doesn't work for storing huge amounts of data. Cheapskate people (and Yahoo!) will use Hadoop. MinHash Google's method of clustering and sorting data. Visibly used in Google News, but I suspect it's also used in clustering sites to assign PageRank. What else do we know? TrustRank is, AFAIK, untrue. So is the SandBox. The so called 'sandbox' is a result of some variant MinHash + some form of spam fighting tool (probably a Markov chain), I believe. I'm very very curious about knowing more on Google's matching system, and also more about its current generation of search engine, TeraGoogle - how much has been implemented. Some people say PR is just for decoration, but I beg to differ. I think PR is vital in SERPs scoring. I want to know what other things are missing from the equation. Also, I am interested in how Google is fighting spam. What methods are they using? Hidden Markov models? Bayes? KNN? Neural Nets? Now, let's crack the Google code.
Should we talk about the fact that Google is not a 100% ethical company, or would you rather us leave that part out?
How about try reading around the forum? I love posts like this because it basically says to me " guys, i don't wanna read like all ya'll did so give me the Cliff Notes". There is a plethora of information on this forum. Read.
I've gone thru most of them. A lot of forummers talk about stuff like TrustRank (yes, I know Google patented it), and that PageRank is just decoration. But I've stated my doubts about stuff like TrustRank (which is also a markov chain method to tell if sites are 'good' or 'bad'), and the sandbox (which I personally think is the result of an amalgated function that manages the rest of the PR and SERPs) I'm interested in the mechanics behind Google. I'd be grateful if someone can shed some light.
Google is an advertising company. LOL and no company in this world is %100 ethical. It tends to informational sites coz their major revenue comes from relevant information search.
On what this statement is based on? Now you will say on G adwords and Adsense, But those advertisement are specific with related search or content. G is a search engine and provide related ad not other ads that we don't want to see.