Crawler-based Search Engine (part 1)

arale Guest

Messages:: 1,215

Likes Received:: 19

Best Answers:: 0

Trophy Points:: 0

#1

In the previous article (Classification of Search Engines) we discussed how the crawler-based engines work. Typically, special crawler software visits your site and reads the source code of your pages. This process is called "crawling" or "spidering". Then, your page is compressed and put into the search engine's repository called an "index". This stage is called "indexing". Finally, when someone submits a query to the search engine, it pulls your page out of the index and gives it a certain rank among the other results it has found for this query. This is called "ranking".

Usually for indexing, crawler-based engines consider much more factors than those they can find on your pages. Thus, before putting your page into an index, a crawler will look how many other pages in the index are linking to yours, the text used in links that point to you, what the Page Rank of the linking pages is, whether the page is present in some directories under related categories, etc. These "off-the-page" factors play a weighty part when the page is evaluated by a crawler-based engine. While theoretically, you can artificially increase your page relevance for certain keywords by adjusting the corresponding areas of your HTML code, you still have much less control over other pages in the Internet that are linking to you. Thus, off-the-page relevance prevails in the crawler's eyes.

In this lesson, we look at the main spider-based search engines, and learn how we can get each of them to index our site and rank it high. Although this step does not closely deal with the optimization process itself, we provide information on how each search engine looks at your pages, so you can come back to this section for reference later.

Google (http://www.google.com/)

You can submit your site to Google here: http://www.google.com/addurl/ and your site will probably be indexed in around 1-2 months.

Please keep in mind that Google may ignore your submission request for a long time. Even if it happens to crawl your site, it may not actually index it if there are no links pointing to it. However, if Google finds your site by following the links from other pages that have already been indexed and are regularly re-spidered, chances are that you will be included without any submission. These chances are much higher if Google finds your site by reading a directory listing.

So, you can submit your site and it may help but links are the best way to get indexed.

In the past, Google basically performed monthly updates called the â€œGoogle Danceâ€ among the experts. At the beginning of the month, a deep crawl of the web took place, then a couple of weeks the Page Rank for the retrieved pages was calculated, and at the end of the month the index database was finally updated. These days, Google maintains a database which is updated continuously. The "Dances" still take place from time to time but only when they need to make major changes to their algorithm. For example, their Dance in November 2003 (known as Google Florida Update) was actually their first for about six months. In January 2004, Google started another dance (Austin Update) where pages that had disappeared during the "Florida" showed up again, and many pages that hadn't disappeared the first time were now gone.

In February 2004 Google updated once more and things settled down. Most people's lost pages came back and although the results were rather different to those shown before Florida , at least pages didn't seem to be gone for no reason.

As if this writing, Google claims to have a little more than 8,000,000,000 pages in its index. The engine constatnly adds new pages to the index database - usually it takes around two days to list a new page after the Googlebot (Google's spider) has crawled it. Results on Google tend to shift on a weekly basis, probably because they are running mini updates. That may be one of the reasons of differences between the results when you're checking your ranking with the help of an automated tool and then look at the results in your browser.

Google has lots of so-called "regional" branches, such as "Google Australia", "Google Canada" etc. These branches are modifications of their index database stored on servers located in the corresponding regions. They are meant to further adjust search results to searcher's needs: when you're searching, Google would detect your IP address (and thus approximate location) and feed the results from the most appropriate index database.

During the "Google Dances" the results on different local googles may differ noticeably.

Submission to the "Main Google" will list your site in all its regional branches - after Google indexes you, of course.

Google has a number of crawlers to do the spidering. They all have the name "GoogleBot" but they come from a number of different IP addresses. You can see if Google has visited your site by looking through your server logs: just find the IP address matching 64.68.xx.xx (alternatively, a domain address crawl2.googlebot.com or crawl3.googlebot.com) and most probably you will see the user-agent defined as GoogleBot.

Google is by far the most important search engine. Apart from their own site receiving 350 million searches per day, they also provide the search results for AOL Search, ICQ Search, and Netscape Search (amongst others). For this reason, most optimizers first focus on Google. Generally, this makes sense.

How to optimize for Google

Most important for Google are three factors: Page Rank, link anchor text and semantics.

Page Rank is an absolute value which is regularly calculated by Google for each page it has in its index. Later in this course we will give you a detailed description, now it's just important to know that the number of links you've got from other sites outside your domain matters greatly, as well as the link quality. The latter means that in order to give you some weight, the sites linking to yours must themselves have high Page Rank, be content-rich and regularly updated.

MiniRank/Local Rank is a modification of the Page Rank based on the link structure of your single site only. Since search engines rank pages, not sites, certain pages of your site will rank higher for given keywords than others. Local Rank has a significant influence on the general Page Rank.

Anchor text is the text of the links that point to your pages. For instance, if someone links to you with the words "see this great web site", this is a useless link. However, let's say you sell car tires and a link from another site to yours says "car tires from leading brands", such a link will boost your rank when someone searches for car tires on Google.

Semantics is the new factor that appears to have made the biggest difference to the results. This term refers to the meaning of words and their relationships. Google bought a company called Applied Semantics back in 2003 and has been using the technology for their AdSense contextual advertising program. According to the principles of applied semantics, the crawler attempts to define which words mean the same thing and which ones are always used together.

For example, if there are a certain number of pages in Google's index saying that an executive desk is a piece of office furniture, Google associates the two phrases . After this, a page about executive desks using the keywords "office furniture" won't show up in a search for the keywords â€œexecutive deskâ€. On the other hand, a page that mentions "executive desk" will rank better if it mentions "office furniture".

Now, there are two other terms related to Google's way to rank pages: Hilltop and Sandbox.

Hilltop is an algorithm that was created in 1999. Basically, it looks at the relationship between the "Expert" and "Authority" pages. An "Expert" is a page that links to lots of other relevant documents. An "Authority" is a page that has links pointing to it from the "Expert" pages.

In theory, Google would find "Expert" pages and then the pages that they link to would rank well. Pages on sites like Yahoo, DMOZ, college sites and library sites can be considered experts.

Sandbox refers to an algorithm which detects how old your page is and how long ago it has been updated. Usually pages with stale content tend to gradually slip down the result list, while the new pages just crawled initially have higher positions than they would if based on Page Rank only. In other words, Google considers new pages have more relevant and up-to-date content and gives them a certain advantage over the stale pages. That is, constantly updating your pages can help keep them up the list.

More parts will be release soon on www.webuniver.com

arale, Aug 20, 2007 IP

mydomainoffer Guest

Messages:: 460

Likes Received:: 12

Best Answers:: 0

Trophy Points:: 0

#2

good information but very long; i would consider adding summary on the top and then elaborate...

mydomainoffer, Aug 20, 2007 IP

Kuldeep1952 Active Member

Messages:: 290

Likes Received:: 18

Best Answers:: 0

Trophy Points:: 60

#3

Concise information, but my views are:

- The time taken by Google is not 1-2 months, but much less.

- You could also probably mention the -30 penalty, -950 penalty, and the Supplemental Index.

Kuldeep1952, Aug 20, 2007 IP

hardarcade Peon

Messages:: 13

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#4

Good article (took me awhile to read). Im starting to understand Google a little better now. Thx

hardarcade, Aug 20, 2007 IP

Log in or Sign up

Crawler-based Search Engine (part 1)

arale Guest

mydomainoffer Guest

Kuldeep1952 Active Member

hardarcade Peon

Useful Searches