:::The Core Of How Google Works:::

nadlay Guest

Messages:: 306

Likes Received:: 4

Best Answers:: 0

Trophy Points:: 0

#21

nohaber said:

Crawling is not only fetching pages. It is a lot of other things. Detecting duplicate documents is important so that you don't recrawl dup-docs and don't waste CPU time and space on indexing dup-docs.
Click to expand...

Surely not - that would mean that if a duplicate document was found, it would be marked never to be crawled again, and the author could never amend it, make it non-duplicate, and then have it re-indexed.

nadlay, Sep 20, 2004 IP

Mel Peon

Messages:: 369

Likes Received:: 14

Best Answers:: 0

Trophy Points:: 0

#22

nohaber said:

Crawling is not only fetching pages. It is a lot of other things. Detecting duplicate documents is important so that you don't recrawl dup-docs and don't waste CPU time and space on indexing dup-docs. If you need reference, just read the damn patent
Click to expand...

Well then I geuss we disagree on bots then. IMO bots are not intelligent programs, they are optimized for one task and one task only, go to a page, query it to see if it has been updated since the last spidering, if so read the page and send it to the repository, then go to the next link and repeat the process. Adding a dup content filter to the bots would seriously slow the crawling process, and at any rate you can do the same thing much more efficiently by running the program over the data base, where you have much more processing power and better bandwidth. To suggest that once a page is detected as a duplicate it would never be crawled again, is contrary to what search engines are set up to so. Did it ever occur to you that pages change?

How would it "filter out" all internal links??? Internal links are counted in the OldScore. The LocalScore has nothing directly to do with internal links. A high OldScore produces a high LocalScore to the document the link points to, so it indirectly uses internal links.
Click to expand...

One of the features of local rank is that it removes all but one page from any one Class C ip address and only ranks the remaining pages. Obviously all internal links and pageas are from the same IP address.

You misunderstood me. You don't want to put up something that increases response time, unless it *substancially* improves SERPs. If it is a *minor* factor, it can't do any good. Why waste CPU time for something *minor*? That's bad algo design.
Click to expand...

Agreed, but you are the one who saysLocalrank is being used, not me, and if it is being used, then by your logic it must be making a substantial difference to the rankings, but I do not see any such result.

How do you define the Core of Google?
It is NOT necessary to check EVERY page for every query unless the programmer is stupid You need reference? Read the damn patents.
Click to expand...

Of course its not who ever said it was???

Here's a snipper from Google's API reference:

"The <filter> parameter causes Google to filter out some of the results for a given search. This is done to enhance the user experience on Google.com, but for your application, you may prefer to turn filtering off in order to get the full set of search results.

When enabled, filtering takes the following actions:

Near-Duplicate Content Filter = If multiple search results contain identical titles and snippets, then only one of the documents is returned.
Host Crowding = If multiple results come from the same Web host, then only the first two are returned. "

(emphasis mine)

Why do you think Google returns less than 1000 results for most queries? Because query-specific dup-doc and host-crowding detection. And yes, it does not take one or two minutes

btw. what is your programming/algorithmic background?
Click to expand...

And you are taking this to mean that the duplicate content filter is based soley on the page title and the SERP snippet??
We must be reading totally different patents.

No that it makes any material difference to the discussion, but a masters degree in engineering, and twenty years experience in the computer field. I am not a programmer, I hire programmers. Insofar as algo background six years of intensive study of just about everything I can find on the subject.

Now my question, why are you so programmer centric? Are you not aware that there are other skills in the world?

Mel, Sep 20, 2004 IP

nohaber Well-Known Member

Messages:: 276

Likes Received:: 18

Best Answers:: 0

Trophy Points:: 138

#23

Mel, I don't know where to start

Well then I geuss we disagree on bots then. IMO bots are not intelligent programs, they are optimized for one task and one task only, go to a page, query it to see if it has been updated since the last spidering, if so read the page and send it to the repository, then go to the next link and repeat the process.
Click to expand...

You mix up the terminology a bit. Crawling is a function of search engines. It is not simply fetching pages. And Google's crawling function is VERY intelligent. Google needs to give a priority to each page, so that low PR pages are crawled once a month, while high PR pages may get crawled a couple of times a day. Read "Efficient Crawling Through URL Ordering". There are other papers on this subject and it is quite an important to search engines. If a search engine crawls the wrong pages, it will lead to very slow updates. You can't call crawling unintelligent.

Adding a dup content filter to the bots would seriously slow the crawling process, and at any rate you can do the same thing much more efficiently by running the program over the data base, where you have much more processing power and better bandwidth.
Click to expand...

By bots you probably mean the URLServers. They are just a small part of crawling. After a page is fetched, fingertips (as Google calls them) are generated and compared to the fingertip database. If a document is found duplicate as another one, its affiliate or groupid or whatever you call it is set the same as the other document. This is done by a union-find operation. If 2 docs have the same groupID, then they are duplicate. At some point later, you may separately run a duplicate site detection application by checking the groupIDs and the linking structure.

To suggest that once a page is detected as a duplicate it would never be crawled again, is contrary to what search engines are set up to so. Did it ever occur to you that pages change?
Click to expand...

I didn't say it would never be crawled again. It is not ordinary to start a site with duplicate pages, and after that developing/changing content. That's plain stupid. Google has the full right to ignore such pages, and use its resources for unique content. If a webmaster makes this kind of stupidity, he will simply need to change the page content and name, and link to it from a page that's not labeled duplicate
Of course, dup-doc-detection is part of the crawling facility. As I said earlier, read the damn papers, look at the images etc.

One of the features of local rank is that it removes all but one page from any one Class C ip address and only ranks the remaining pages. Obviously all internal links and pageas are from the same IP address.
Click to expand...

Internal links don't count in LocalRank, but they DO count in the OldScore. LocalRank is meant to increase the relevancy of the pages, who have non-affiliated links from other very ranked pages.

Agreed, but you are the one who saysLocalrank is being used, not me, and if it is being used, then by your logic it must be making a substantial difference to the rankings, but I do not see any such result.
Click to expand...

How do you know that?

And you are taking this to mean that the duplicate content filter is based soley on the page title and the SERP snippet??
Click to expand...

As I said, read the god damn patentss. There are two types of dup-content.
1. Duplicate and near-dup documents when two WHOLE docs are dup.
2. QUERY-SPECIFIC dup. documents. Two docs may be quite different, but if they have 3 paragraphs in common, and the query keywords are found there, google may decide that they are query-specific dups. There's a separate patent of query-spec. doc detection. It's part of the ranking function.

Now my question, why are you so programmer centric? Are you not aware that there are other skills in the world?
Click to expand...

I am not programmer centric. Programming is just one of the things you need to become a SEO expert. You need a substancial experience solving complex algorithmic problems. I know it. I have participated in loads of programming contests and I have seen great programmers solving nothing, because they don't know algorithms. Programming is a skill to code. Algorithms is another animal.

If you work at Google, you have to have it all. If you invent some great way to increase the relevancy and it turns out that it needs two minutes to execute a query, you have simply wasted time.

I have had my ass wooped not a single time on competition and I have the utmost respect to guys who have the background to do it. On competitions, you are given 5 to 7 tasks, a time limit of lets say 1 second, and every problem has just one efficient way to be solved. If you don't know that algorithm, you may be the best programmer in the world, but you can't solve it to run under 1 second. You get 0 points.

The best high-school and university programmers and algorithmically trained people participate in competitions, and a lot of times more than 50% of all teams solve 0 problems.

I don't understand the proof of maybe 50% of all the more complicated algorithms that I've used at competitions, but I can code them correctly.

Now, if some idiot comes to me, and tells me he is not a programmer but knows how to solve such tasks, I tell him he is full of it.

Neither you, or I, have the needed background to speak authoratively on what google does. I have a lot of algo/progr. experience, but I lack the specific information, SE engineers have. They have read, tried, experimented, coded etc for years. I have only read the public literature.

If I make a great software program, and some self-proclaimed never-before-programmed, let alone know anything about algos, start yapping around, explaining to other people how my program works, I'll be either rofling or get appalled.

I can only imagine how Google guys and gals laugh at the pathetic efforts of SEO experts to explain Google. That's why I am very careful at what I say.

I asked you about your background, because a programmer would never use your language and terms to try to explain certain things in Google. It is just obvious.

I am in no way trying to insult anyone. I am just saying that we shouldn't talk too much about things we don't know and attach "expert" "guru" or whatever nonsense to us.

We are all Search Engine Amateurs and we are chatting here about search engines

nohaber, Sep 21, 2004 IP

Mel Peon

Messages:: 369

Likes Received:: 14

Best Answers:: 0

Trophy Points:: 0

#24

nohaber said:

..

You mix up the terminology a bit. Crawling is a function of search engines. It is not simply fetching pages. And Google's crawling function is VERY intelligent. Google needs to give a priority to each page, so that low PR pages are crawled once a month, while high PR pages may get crawled a couple of times a day. Read "Efficient Crawling Through URL Ordering". There are other papers on this subject and it is quite an important to search engines. If a search engine crawls the wrong pages, it will lead to very slow updates. You can't call crawling unintelligent.

By bots you probably mean the URLServers. They are just a small part of crawling. After a page is fetched, fingertips (as Google calls them) are generated and compared to the fingertip database. If a document is found duplicate as another one, its affiliate or groupid or whatever you call it is set the same as the other document. This is done by a union-find operation. If 2 docs have the same groupID, then they are duplicate. At some point later, you may separately run a duplicate site detection application by checking the groupIDs and the linking structure.
Click to expand...

Yes I've read the papers, but we seem to read them diffently.

When I say bots I mean bots and when I say url servers I mean URL servers. While it is true that the URL servers have mechanisms for prioritizing the crawling order this does not make the bots intelligent, they are a seperate function/program.

They are called fingerprints not finger tips by the way.

I didn't say it would never be crawled again. It is not ordinary to start a site with duplicate pages, and after that developing/changing content. That's plain stupid. Google has the full right to ignore such pages, and use its resources for unique content. If a webmaster makes this kind of stupidity, he will simply need to change the page content and name, and link to it from a page that's not labeled duplicate
Of course, dup-doc-detection is part of the crawling facility. As I said earlier, read the damn papers, look at the images etc.
Click to expand...

May I suggest that you reread them? Here is a quote from Googles Patent number 6,659,423 on duplicate page detection (Specifications Page 21):

The document first lists the five primary methods in which the invention may be used and then states:

In the context of a search engine the present invention may also be used during a crawling operation to speed up the crawling operations and to save bandwidth by not crawling near-duplicate Web pages and sites, as determined from documents discovered from a previous crawl......

The present invention may also be used after the crawl such that if one or more documents...

The present invention can instead be used later, in reponse to a query...
Click to expand...

From this it is apparent to me that this is not a system primarily proposed for use during crawling, although that is a possible use of it, and it certainly presents no proof that the patent is used by bots at the time they spider the pages, as you seem to imply.

Internal links don't count in LocalRank, but they DO count in the OldScore. LocalRank is meant to increase the relevancy of the pages, who have non-affiliated links from other very ranked pages.
Click to expand...

Yes I think thats what I said, but I do not think it increases the relevany of the pages but perhaps the results. If you set the variables in Local Rank you can have either oldscore or local rank as more or less important, and as you say there would be no logical reason to implement localrank if it were not going to be an important part of the ranking system, therefor I ask again what is your evidence that local rank is in fact in use?

...
I am not programmer centric. Programming is just one of the things you need to become a SEO expert. You need a substancial experience solving complex algorithmic problems. I know it. I have participated in loads of programming contests and I have seen great programmers solving nothing, because they don't know algorithms. Programming is a skill to code. Algorithms is another animal.
Click to expand...

Fine you have your ideas of what an expert is and I have mine, but I can assure you that simply being a programmer will not make you a successful SEO, and that is a subject in which I believe I have enough experience to discuss with authority.

Neither you, or I, have the needed background to speak authoratively on what google does. I have a lot of algo/progr. experience, but I lack the specific information, SE engineers have. They have read, tried, experimented, coded etc for years. I have only read the public literature.

If I make a great software program, and some self-proclaimed never-before-programmed, let alone know anything about algos, start yapping around, explaining to other people how my program works, I'll be either rofling or get appalled.

I can only imagine how Google guys and gals laugh at the pathetic efforts of SEO experts to explain Google. That's why I am very careful at what I say.

I asked you about your background, because a programmer would never use your language and terms to try to explain certain things in Google. It is just obvious.

I am in no way trying to insult anyone. I am just saying that we shouldn't talk too much about things we don't know and attach "expert" "guru" or whatever nonsense to us.

We are all Search Engine Amateurs and we are chatting here about search engines
Click to expand...

No one said we were search engine experts, or search engine amateurs, we are not trying to build a search engine, we are discussing SEO here, but just having programming ability does not make anyone a search engine expert. There are hundreds of search engines coded every year, some never see the light of day, some do but die after a few months, and some may succeed, but IMO those that do succeed with have that rare combination ofa vision of what a search engine should do, mathmatical skills, Programming skills, and the business acumen to make it viable.

At any rate this discussion has yielded nothing new or any evidence that Localrank is in fact in use, etc etc, and thus there are better uses for large blocks of time such as this.

Mel, Sep 21, 2004 IP

nohaber Well-Known Member

Messages:: 276

Likes Received:: 18

Best Answers:: 0

Trophy Points:: 138

#25

Let me cite the original paper:

Finally, the major applications: crawling, indexing, and searching will be examined in depth.
Click to expand...

The 3 major applications are crawling, indexing and searching. Where do you think is the duplicate-doc detection? Even if the detection information is used later, the ACTUAL detection is part of the crawling application. And the crawling application IS intelligent.

The query specific doc detection is in the searching application. Here's a link to the second patent: Detecting query-specific duplicate documents

but I can assure you that simply being a programmer will not make you a successful SEO
Click to expand...

I've never said that. I just said the programming/algos is a prerequisite to be a SEO expert, and lacking any of the two automatically means you can't understand SEs.

At any rate this discussion has yielded nothing new or any evidence that Localrank is in fact in use, etc etc, and thus there are better uses for large blocks of time such as this.
Click to expand...

I have evidence. I have monitored "diet software" and "fitness software" keyphrases for enough a period of time and LocalRank IMO is used. Example: The only site that does not have a DMOZ listing is mine, and although I have much better SEO that my competitors, a lot of them outrank me because they have a link from DMOZ from a page that already ranks high for fitness software. Example 2: If LocalRank is not used, then the top results will be the so-called authority sites or DMOZ and similar directories. But because these auth. sites link to the actual competitors, they push them up with LocalRank, so the user don't get a bunch of directory listings, but the actual sites. That's IMO one of the purposes of LocalRank.
Example 3: the #1 ranked site for Search Engine Optimization has links from many other highly ranked pages for the term. If LocalRank is used, they'll get the highest LocalRank score.

nohaber, Sep 21, 2004 IP

Mel Peon

Messages:: 369

Likes Received:: 14

Best Answers:: 0

Trophy Points:: 0

#26

nohaber said:

Let me cite the original paper: The 3 major applications are crawling, indexing and searching. Where do you think is the duplicate-doc detection? Even if the detection information is used later, the ACTUAL detection is part of the crawling application. And the crawling application IS intelligent.
Click to expand...

I see a statement but nothing new to back it up. While they choose to discuss the 3 major applications, there could be and arein fact hundreds of other applications in use.

The query specific doc detection is in the searching application. Here's a link to the second patent: Detecting query-specific duplicate documents
Click to expand...

And what does that have to do with crawling or our discussion?

I've never said that. I just said the programming/algos is a prerequisite to be a SEO expert, and lacking any of the two automatically means you can't understand SEs.
Click to expand...

Well your are entitled to your opinion and I to mine. I am not a programmer and I seem to be able to discuss this topic with you, but to be brutally frank the word that comes to mind is bullshit.

I have evidence. I have monitored "diet software" and "fitness software" keyphrases for enough a period of time and LocalRank IMO is used. Example: The only site that does not have a DMOZ listing is mine, and although I have much better SEO that my competitors, a lot of them outrank me because they have a link from DMOZ from a page that already ranks high for fitness software. Example 2: If LocalRank is not used, then the top results will be the so-called authority sites or DMOZ and similar directories. But because these auth. sites link to the actual competitors, they push them up with LocalRank, so the user don't get a bunch of directory listings, but the actual sites. That's IMO one of the purposes of LocalRank.
Example 3: the #1 ranked site for Search Engine Optimization has links from many other highly ranked pages for the term. If LocalRank is used, they'll get the highest LocalRank score.
Click to expand...

If this is the best evidence you have then its back to the drawing board as far as I'm concerned. I do not agree with either of your conclusions, nor do I see anything that could not be caused by nine or ten other things.

Mel, Sep 21, 2004 IP

nohaber Well-Known Member

Messages:: 276

Likes Received:: 18

Best Answers:: 0

Trophy Points:: 138

#27

I see a statement but nothing new to back it up. While they choose to discuss the 3 major applications, there could be and arein fact hundreds of other applications in use.
Click to expand...

LOL. Let's call them thousands Maybe we should reinvent the SE field.

And what does that have to do with crawling or our discussion?
Click to expand...

I guess we were also discussing dup-doc detection, the two types of dup-docs etc. Nevermind.

Well your are entitled to your opinion and I to mine. I am not a programmer and I seem to be able to discuss this topic with you, but to be brutally frank the word that comes to mind is bullshit.
Click to expand...

The word here is "seem". It's OK with me that you call me a bullshiter.

If this is the best evidence you have then its back to the drawing board as far as I'm concerned. I do not agree with either of your conclusions, nor do I see anything that could not be caused by nine or ten other things.
Click to expand...

You have no idea of what I have observed for these queries, but you made a conclusion just on one sentence. That's OK with me.

Have a nice day

nohaber, Sep 21, 2004 IP

Jan Peon

Messages:: 129

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#28

Just a simple question:
Suppose I have a page of 1000 words and I copy a few sentences on a different page, would that be seen as duplicate content?
Or suppose I have a pdf document of a few hundred pages and copy a few pages on another page, would that be duplicate content to the spiders?

Jan, Sep 21, 2004 IP

paradox Well-Known Member

Messages:: 116

Likes Received:: 3

Best Answers:: 0

Trophy Points:: 103

#29

I'm sure that the few sentences from a larger article would not be considered bad duplicate content otherwise how could any web page offer "quotes" from other articles on the web?

paradox, Sep 21, 2004 IP

nohaber Well-Known Member

Messages:: 276

Likes Received:: 18

Best Answers:: 0

Trophy Points:: 138

#30

Suppose I have a page of 1000 words and I copy a few sentences on a different page, would that be seen as duplicate content?
Click to expand...

In this situation you have different documents. They aren't duplicate. If someone is searching for keywords that are used in the few duplicate sentences, Google may filter one of the pages from the SERPs for the given query.

nohaber, Sep 22, 2004 IP

john_loch Rodent Slayer

Messages:: 1,294

Likes Received:: 66

Best Answers:: 0

Trophy Points:: 138

#31

Mel said:

Yes John, but the question is if these large distributed networks of sites which are built only for ranking purposes (and they are, no matter what fancy window dressing they try to use) are ethical and if they may not be subject to penalties.
Click to expand...

I won't go down the ethics path. You own the site/s, you do what ever the hell you want to with them. End of story. If the search engines make money indirectly off the content they cache, good for them.

As for penalties, well that really makes the point doesn't it. If you're good at what you do, you won't be penalized. At the end of the day, and as G well knows, it comes down to scale of ownership. Whether it's 100 webmasters all looking to seo their individual sites, or one webmaster looking to seo 100 sites, it's still a network. The approach just changes, that's all.

If I'm giving all parties what they want (the bots, the audience, the stakeholders, and my banker), then I'm happy

Now all I gotta do is figure out this network management thingy

john_loch, Sep 22, 2004 IP

Mel Peon

Messages:: 369

Likes Received:: 14

Best Answers:: 0

Trophy Points:: 0

#32

john_loch said:

I won't go down the ethics path. You own the site/s, you do what ever the hell you want to with them. End of story. If the search engines make money indirectly off the content they cache, good for them.
Click to expand...

And the search engines have the right to index you or drop you depending on if they like what you do.

As for penalties, well that really makes the point doesn't it. If you're good at what you do, you won't be penalized. At the end of the day, and as G well knows, it comes down to scale of ownership. Whether it's 100 webmasters all looking to seo their individual sites, or one webmaster looking to seo 100 sites, it's still a network. The approach just changes, that's all.
Click to expand...

I guess we define networks differently, and BTW Google does care if you use a network of sites to get your rankings and has removed many sites from the rankings for that.

If I'm giving all parties what they want (the bots, the audience, the stakeholders, and my banker), then I'm happy

Now all I gotta do is figure out this network management thingy
Click to expand...

Lots of luck.

Mel, Sep 22, 2004 IP

Jan Peon

Messages:: 129

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#33

paradox & nohaber, thanks very much!

Jan, Sep 22, 2004 IP

WilliamC Well-Known Member

Messages:: 252

Likes Received:: 27

Best Answers:: 0

Trophy Points:: 118

#34

nohaber said:

Mel, I don't know where to start

You mix up the terminology a bit. Crawling is a function of search engines. It is not simply fetching pages. And Google's crawling function is VERY intelligent. Google needs to give a priority to each page, so that low PR pages are crawled once a month, while high PR pages may get crawled a couple of times a day.
Click to expand...

Errr, sorry, but I have to say something to this. Having a simple hash in the database field for a URL with the pages pr/last time of crawl is simple and checking against that each crawl is even simpler to determine what pages to crawl and what pages not to crawl. This is not how google does it at all, however to say that the spider needs to be intelligent is a joke. It does not, it's coders may be, but the damn spider is still just a program.

As for the duplicate filter, we were the first to report its existence with documentation, and I have not found anyone else that has tracked its usage at all, except to say, Oh my I was hit by it. The dupe content filter is exactly that. A filter which is run at periodic times across the index. Nothing more, and certainly not a core agent. If you read the method of fingerprinting dupe pages a little harder you would be able to see the patent is not made for an ongoing process.

WilliamC, Sep 22, 2004 IP

nohaber Well-Known Member

Messages:: 276

Likes Received:: 18

Best Answers:: 0

Trophy Points:: 138

#35

Errr, sorry, but I have to say something to this. Having a simple hash in the database field for a URL with the pages pr/last time of crawl is simple and checking against that each crawl is even simpler to determine what pages to crawl and what pages not to crawl. This is not how google does it at all, however to say that the spider needs to be intelligent is a joke. It does not, it's coders may be, but the damn spider is still just a program.
Click to expand...

If a software program is called let's say "Product X" and it implements an algorithm that does something intelligently, you can say that "Product X" is intelligent.
I wouldn't call URL ordering in crawling a simple application. If it were simple, no one would be publishing papers on it. The more the web grows, the more important crawling becomes. Your description lacks at least a couple of factors: what if a server is down? what about a page age, # of slashes in the URL, new pages or old page etc. You'll also have to make the ordering such that the crawling wouldn't get stuck at crawling only high PR pages and ignoring new/low PR ones. There are many details. You can't say it is simple.

As for the duplicate filter, we were the first to report its existence with documentation
Click to expand...

First of what/who?

and I have not found anyone else that has tracked its usage at all, except to say, Oh my I was hit by it.
Click to expand...

How do you track the usage of a dup filter?

Nothing more, and certainly not a core agent. If you read the method of fingerprinting dupe pages a little harder you would be able to see the patent is not made for an ongoing process.
Click to expand...

You are right. The way the exemplary implementations are described, it would suggest that it is periodically run. It can also be made as a constantly ongoing process. It all depends on how Google implemented it. There are pros and cons to both approaches. Let's take a look at them:
1) periodically run application
The pros:
code separation and easier programming, less storage space (removing unique fingerprints)
The cons:
if you remove fingerprints, you'd have to generate them on the next pass. That requires more CPU power.
what about newly found pages? They'd have to wait until the next dup-doc pass. So Google would have to risk showing dup-docs, or just wait until the next pass. That would slow down the time a page needs to get in the index: it will have to wait for the next dup-doc pass, next it would have to get parsed,indexed, sorted and put in the inverted index. That is not a bad feature if you want to delay new pages from getting ranked quickly. Also you can pair the dup-doc pass with PR recalculations and maybe other stuff before major updates. A periodically run application on the whole repository will probably take days, and it would be done once in a couple of weeks.
2) constantly ongoing process
The pros:
dup documents would be detected on the fly and can get indexed more quickly
The cons:
modification of currently used and well-tested code
more data storage

Which of the two does Google use? I don't know. Maybe you are right. Maybe, it is a periodically run application, although I wouldn't bet 100% on it. It is not the patent, it is Google that decides in the end. If it is periodically run, we can expect to have slower and slower ranking time for new pages as the web grows.

nohaber, Sep 23, 2004 IP

WilliamC Well-Known Member

Messages:: 252

Likes Received:: 27

Best Answers:: 0

Trophy Points:: 118

#36

I wouldn't call URL ordering in crawling a simple application. If it were simple, no one would be publishing papers on it.
Click to expand...

People have published papers on the proper use of <b> tags.

I have written more than my share of spidering applications, I can definately say it is not difficult. I think in your arguement that you forget that Gbot is not actually only one bot on one machine. Google uses quite a few Gbot apps on a host of machines. For all we know each one may do a certain PageRank, or they may even co-exist in a communicable relationship. Either way, URL ordering is not difficult.

First of what/who?
Click to expand...

This would take way too long to type out fully so let me just refer you to a URL in PM. I dont want to drop another forum url in open thread.

How do you track the usage of a dup filter?
Click to expand...

With the intelligence of your posts that I have seen so far, it should not take a quantum leap for you to figure that question out.

WilliamC, Sep 23, 2004 IP

nohaber Well-Known Member

Messages:: 276

Likes Received:: 18

Best Answers:: 0

Trophy Points:: 138

#37

People have published papers on the proper use of <b> tags.
Click to expand...

LOL. Give me one of these. I'd like to take a good laugh.

I have written more than my share of spidering applications, I can definately say it is not difficult. I think in your arguement that you forget that Gbot is not actually only one bot on one machine. Google uses quite a few Gbot apps on a host of machines. For all we know each one may do a certain PageRank, or they may even co-exist in a communicable relationship. Either way, URL ordering is not difficult.
Click to expand...

Have you written a large-scale SE crawling application dealing with millions of URLs? I haven't and I personally consider this a very difficult programming/algorithmic task.

The actual code for fetching pages is easy stuff. What is difficult is: "Hey, I have 5 millions of pages in the index + 1 millions discovered/still not crawled URLs. Which URLs should I crawl first?". That's difficult, IMO.

This would take way too long to type out fully so let me just refer you to a URL in PM. I dont want to drop another forum url in open thread.
Click to expand...

The URL you sent me is from "Apr 10, 2004". Duplicate-detection has been a part of Google a long way before that date. In April 2004, Google might have improved their algorithm, but dup-detection is one of the first tasks one should take care. Let me cite Jon Glick from Yahoo: "You know, if you have a great relevancy algorithm and lousy Spam detection you just get a bad experience for instance."

nohaber, Sep 23, 2004 IP

WilliamC Well-Known Member

Messages:: 252

Likes Received:: 27

Best Answers:: 0

Trophy Points:: 118

#38

nohaber said:

The URL you sent me is from "Apr 10, 2004". Duplicate-detection has been a part of Google a long way before that date. In April 2004, Google might have improved their algorithm, but dup-detection is one of the first tasks one should take care. Let me cite Jon Glick from Yahoo: "You know, if you have a great relevancy algorithm and lousy Spam detection you just get a bad experience for instance."
Click to expand...

Actually that was the very first instance of anyone reporting widespread duplicate content causing sites to vanish from the index en masse with proof on any forum or article, I even checked to make sure

It is already widely known that googles spam detection really is almost non-existant. The only real times spam is removed, it is not done algorythmically but manually in a spam campaign for a certain type of spam. I am surprised you think they actually have good spam detection. Take for instance the well known common spam types, hidden text and keyword spamming, nothing is done to these in the algo. Google knows they exist and seeing as these could be beaten by a bit of code in the spider, but are not, it stands to reason google has decided they are not feasable at runtime for whatever reason, be it resource drain, memory issues, or whatever. We see periodic campaigns to deal with each of these, or manual dealings when reported sometimes.

Have you written a large-scale SE crawling application dealing with millions of URLs? I haven't and I personally consider this a very difficult programming/algorithmic task.
Click to expand...

1 mil + yes. Billions, no, which is what google has indexed. 4.2x billion to be precise according to the numbers on their site. And it is still not a hard thing to do at all. The hard part would be the database itself.

The actual code for fetching pages is easy stuff. What is difficult is: "Hey, I have 5 millions of pages in the index + 1 millions discovered/still not crawled URLs. Which URLs should I crawl first?". That's difficult, IMO.
Click to expand...

I take it then that you are not a coder. Assume for a minute that the spider in question is using the method of multi instances for each PR level, and our example bot is doing PR4, the bot simply pulls from all pr4 listings in order and crawls them. This is no difficult task. It is a robot, it does what it is told, nothing more. Assume, before you ask the question, that it can not crawl a URL, it simply drops it into a queue for later crawling when it frees up some processing time.

Again, not difficult.

You are giving spiders more credit than they are due at this date and time. As much as you would like them to be all-knowing and all-seeing, they are not. Wishful thinking is great but it does not help when the real world just doesn't work that way.

WilliamC, Sep 23, 2004 IP

nohaber Well-Known Member

Messages:: 276

Likes Received:: 18

Best Answers:: 0

Trophy Points:: 138

#39

If Google didn't have some sort of dup-doc detection before April, we would have been seeing lots duplicate content in the results (as was with M$ tech preview).

It is already widely known that googles spam detection really is almost non-existant. The only real times spam is removed, it is not done algorythmically but manually in a spam campaign for a certain type of spam. I am surprised you think they actually have good spam detection. Take for instance the well known common spam types, hidden text and keyword spamming, nothing is done to these in the algo. Google knows they exist and seeing as these could be beaten by a bit of code in the spider, but are not, it stands to reason google has decided they are not feasable at runtime for whatever reason, be it resource drain, memory issues, or whatever. We see periodic campaigns to deal with each of these, or manual dealings when reported sometimes.
Click to expand...

I was talking about using dup-docs to artificially hoard PR as a spam technique. As far as hidden text, it is not detected because the Google's parser does not use exeternal CSS files. That's a bit strange to me, but they have decided not to take external CSS files into consideration. That's why I've always wondered why people think that H tags are magical. Google does not use external CSS, and it must guess the font sizes of tags.

1 mil + yes. Billions, no, which is what google has indexed. 4.2x billion to be precise according to the numbers on their site. And it is still not a hard thing to do at all. The hard part would be the database itself.
Click to expand...

It depends. If you have thousands of servers and bandwidth, you can write a simple iterative crawling algorithm. But if the number of pages to crawl is more than you can, you have to guess which is better to crawl first.

I take it then that you are not a coder. Assume for a minute that the spider in question is using the method of multi instances for each PR level, and our example bot is doing PR4, the bot simply pulls from all pr4 listings in order and crawls them. This is no difficult task. It is a robot, it does what it is told, nothing more. Assume, before you ask the question, that it can not crawl a URL, it simply drops it into a queue for later crawling when it frees up some processing time.
Click to expand...

I am actually a coder. I have participated many programming contests in my high-school/university years(have national/univ awards as well as a first place at an international conference of young scientists with a data compression program back in '96). Some of the tasks that I've solved would have been trivial if I had lots of RAM for example. But, sometimes resources are not enough. The web grows very fast and the number of URLs grows quite fast. Are you saying that Google has enough resources to crawl all of them every day? No. That's the main concern. A crawling application has to make smart decisions. It has to work very well with scarce resources and do well in the ever growing web.

Let me again cite a paper from Google guys:
"The design of a good crawler presents many challenges. Externally, the crawler must avoid overloading Web sites or network links as it goes about its business [Kos95]. Internally, the crawler must deal with huge volumes of data. Unless it has unlimited computing resources and unlimited time, it must carefully decide what URLs to scan and in what order. The crawler must also decide how frequently to revisit pages it has already seen, in order to keep its client informed of changes on the Web."

(emphasis mine)

nohaber, Sep 23, 2004 IP

WilliamC Well-Known Member

Messages:: 252

Likes Received:: 27

Best Answers:: 0

Trophy Points:: 118

#40

nohaber said:

Are you saying that Google has enough resources to crawl all of them every day? No. That's the main concern. A crawling application has to make smart decisions. It has to work very well with scarce resources and do well in the ever growing web.
Click to expand...

You are making it sound like it needs to crawl all 4.2x billion pages daily. It does not, it bases frequency of crawls mainly on pagerank and number of incoming links to the page. As to google having the resources, I know if I had a 10,000 server farm, I could.

Let me again cite a paper from Google guys:
The crawler must also decide how frequently to revisit pages it has already seen, in order to keep its client informed of changes on the Web."
Click to expand...

Yes that decision is based on PR and IBL's. Not intelligent spiders.

I am actually a coder. I have participated many programming contests in my high-school/university years(have national/univ awards as well as a first place at an international conference of young scientists with a data compression program back in '96).
Click to expand...

I actually thought you might have some background in it. However based on your answers, I think it is safe to say you probably have no real world experience with spidering applications.

WilliamC, Sep 23, 2004 IP

Log in or Sign up

:::The Core Of How Google Works:::

nadlay Guest

Mel Peon

nohaber Well-Known Member

Mel Peon

nohaber Well-Known Member

Mel Peon

nohaber Well-Known Member

Jan Peon

paradox Well-Known Member

nohaber Well-Known Member

john_loch Rodent Slayer

Mel Peon

Jan Peon

WilliamC Well-Known Member

nohaber Well-Known Member

WilliamC Well-Known Member

nohaber Well-Known Member

WilliamC Well-Known Member

nohaber Well-Known Member

WilliamC Well-Known Member

Useful Searches