I would have to say there is an argument for proximity of words near keywords... "Check out" (this site) and "Check out now" (buy now) is something I have been looking at a bit, to see if Google's semantic algos are working properly (no conclusions yet). If I were building my own, somebody writing about polictics suddenly has a link to a widget site in the text, it would flag. (We've seen these.) Same for plain text. If they suddenly went off-topic, talking about widgets, the page would score for widgets and politics much less. A simple mention of widgets would not.
I agree, and PageRank documentation does support this theory. 6-7 years ago Google were storing the first 4096 bytes of plain text on the page and using proximity of words to find results to best match the query.
I remember reading that now, and experimenting with this: http://forums.digitalpoint.com/showpost.php?p=181163&postcount=5 It worked well. Nowadays, you can do similar stuff with css.
The google algorithm, is the such a closely guarded secret, that trying to figure it out completely would be like trying to find a pin, in a needle stack. I sure hope google treats their employee's like gold, otherwise some disgruntled employee could spill the beans on there algoithm.
I think this is a really good exercise even if we can't find all of the factors. It gets us thinking in creative ways that can only benefit our seo efforts. I don't think Google's algo takes into account the meta tags, especially the meta keywords tag. It uses the meta description sometimes I think but I'm not sure if it plays into the ranking. I would think that the DMOZ description would be very good for Google to find out what a site's about since it is reviewed by editors who look at the website to make sure that it is an accurate description and is pretty much guaranteed to not have keyword stuffing or anything else to manipulate search engines unless the editor is asleep on the job.
you guys have come up with some good ones that I wasn't even thinking about. Whether or not all of these are ACTUAL variables, I think making the list of POSSIBLE variables is a good exercise. I would update my post to include the new stuff but I can't edit it now. Maybe one of the MODS can either make this thread a sticky or make a new thread a sticky with the compiled list that they can add to. I don't know how it works, I just think this information shouldn't be lost on page 5 in a couple of weeks. Just my thoughts. Keep them coming!!
It is very easy to undertand what is tried to put in Google algo that is to judge the webpage as much wisely as professional human can judge and find out for which "keyword" it is useful and note it in notebook. That is rocket science and that is simply plain idea.
Thanks speakerwire and everyone else for some greats posts. I have started to try to arrange the items in word to copy and repost one great complete list - but of course it won't be complete because we are still hundreds of factors missing! I appreciate the great answers and will post the list soon. Please add any more you can think of in the meantime and I'll add those in too. I think this will produce a great list of things to bear in mind for any web developer.
OK, I am just putting the list together - can somebody just clarify whether the following items suggested are possible... Physical address listed - can worldwide addresses be identified, would it look for a zip/postal code? Would they build a list of available codes and if one of them was found, along with the word 'address' or the word 'contact' on a page give the site a point? I refer to points as I assume google has to build a score of your site based on the 600 (assumed) factors. Spelling - can anybody think of any well-known site of any type which ranks highly in google which has spelling which you could normally not find in a dictionary? Page Colours - I am no expert on hex colour values but if they go from 000000 to FFFFFF, is each one progressively darker than the next? Would they look for conflicting colours or hard to read. Again can anyone suggest a site that might disprove this? Thanks.
As far as colours go...I doubt google looks at whether they are too contrasting or conflicting. Mainly because who is to say what people like. some people like smooth flowing colour schemes, and others like high contast in their colours to make things stand out. The only thing I could see google taking into account, would maybe if the background colour is the same as some text colour, so as to hide the text from visitors, which google doesn't like. And I have seen this done on websites, not so much anymore, but in the past, most definately.
I think physical address, phone, email address, etc. should not and probably are not a factor. Many companies have these behind robots.txt files, or noindex'd, to avoid junk mail, spam, and unwarranted cold sales calls. Many lists are built and sold by what can be found by a simple spider, and it would make sense that scraping the serps from Google for such info would also yield profitable data. If that does not work, you could get some hack in a third world country to sit there all day and copy and paste for cents an hour. Try posting your email address on a decent ranking site and see how the spam goes up. This increases time spent dealing with email, even if you delete it or use spam filters. But some always gets through, slowing down the service.
This might not be a secret to much longer. If the goverment gets its way and Google chooses to battle this in court, during cross several of Googles algo secrets will come out. Google even says in their answer to the goverments motion that this is a concern of theirs. I can see it now, Yahoo, MSN and several Seo's world wide sitting in the court room day in and day out waiting for something to come out.
In the past I've had a site rank higher with a physical address listed on the contact page. I can see why this would be a factor - a site with a legitimate postal address can be more trustworthy than one without. I recently developed a web crawler that takes physical addresses very seriously on a site. It uses geodetic transformations and effectively "maps" sites around the country, then breaks them down by topic. It's still in beta stages at the moment but it seems to bring back some great results. I feel like I spend more time with my head in research papers than I do coding... Many SEO contests use non-dictionary phrases (I would quote some, but I can't remember how to spell them ). I could imagine Google penalising a site with more than X% non-dictionary words (i.e. spelling mistakes), but I haven't seen proof. Close, its actually the other way round. Yes, hidden text is now hunted down like the cheap trick it is (white text/white background), and I believe the new GoogleBot has been developed to spot similar tricks with CSS and/or hidden divs. I'll try to think of more possible variables today, see what I can come up with. Very interesting thread
The first mistake I see here is the use of the word "algorithm". There are MANY different algorithmS and processes at work.
Interesting thread... perhaps there's a Google factor that flags "hot" discussions about Google factors for their attorneys to investigate further. It sure seems like they have people with spare time to keep up on this stuff. I'm curious about (d). Would this be analogous to "bad neighborhood"? i.e. if your DNS provider also hosts many xxx-related domains, etc. Or ranking different DNS providers -- does Verisign outrank GoDaddy? etc... Another comment I had about factors and weightings: I think its been discussed before that different content/topic spaces could very well have different factors/thresholds, i.e., different weightings. So (simple example) even if you knew how H1 fared vs. 'alt' tags, that equation might be different for casino sites vs. religion sites. A few other things that certainly are factors include time-based and event-based relevancy. So "heart-shaped box" could mean one thing around Feb 14th, and a totally different thing around late March (American Heart Assoc. "Heart Walk" events). For news/relevancy items, Google has GoogleNews, a well-segmented source for classifying "current event" keywords and topics. LC
Robert Scoble from MSN started a search engine experiment with blogs using a word he made up called "brrreeeport" and his site is pr 6 and probably ranked highly for some keywords other than the obvious, "scoble" and "brrreeeport." Were you wondering about spelling mistakes and how the search engines treat sites with spelling mistakes? If so, this might be a good one to watch. I doubt MSN penalizes for this since Scoble is an MSN search guy and he's doing it all over his blog.
Do you mean he is misspelling words on purpose, or by accident? I would expect a dictionary of common misspellings to be used, I doubt that brrreeeport would be in it. http://itre.cis.upenn.edu/~myl/languagelog/archives/001533.html has many purposefully misspelled names, in blogging about a mural that had 11 misspellings. Which gives me another idea... using the wrong, correctly spelled, but similar sounding/spelled word?
Essentially this is the most comprehensive SEO thread I've ever read. Will be keeping my mince pies on this one!
A few more I did not see listed above: Validated HTML. Following accessibility guidelines for those with disabilities per w3.org (no spam). NO Broken links