Well, isn't using PageRank for ordering the URLs intelligent, smart, clever or whatever you call it? Can you invent a smarter algorithm that would work well for a large-scale search engine? If that's so elementary, you might apply for a job at some of the leading search engines I have no programming experience with spidering applications, and of course, I've never claimed I've had. I've had enough experience with algorithmically tough applications. I have been in projects with some of the best Bulgarian programmers, including national programming champions. I have learned to respect the capable programmers. That's one of the main themes of my postings. Respect for the ones at Google and the other engines. People are writing about one of the finest peices of software on the planet as if its something elementary to put up. People with zero programming (let alone algorithms) are claiming to be experts and know what Google does. It's just ridiculous. Let me finish with a passage from the original paper: "Also, because of the huge amount of data involved, unexpected things will happen. For example, our system tried to crawl an online game. This resulted in lots of garbage messages in the middle of their game! It turns out this was an easy problem to fix. But this problem had not come up until we had downloaded tens of millions of pages. Because of the immense variation in web pages and servers, it is virtually impossible to test a crawler without running it on large part of the Internet. Invariably, there are hundreds of obscure problems which may only occur on one page out of the whole web and cause the crawler to crash, or worse, cause unpredictable or incorrect behavior. Systems which access large parts of the Internet need to be designed to be very robust and carefully tested. Since large complex systems such as crawlers will invariably cause problems, there needs to be significant resources devoted to reading the email and solving these problems as they come up."
You missed my point completely. The spider did not just decide to use PageRank for URL ordering. It's creators did. They are smart yes, as I stated well above, the spider is not. I agree they are among the best written pieces of software out, never claimed them to be simple, I merely said that the basics were not difficult to understand by any coder with any experience. But by posting things that make the software look intelligent as an AI just adds to the confusion and misconceptions.
The question in my mind is does the spider decide where it wants to go next or does it just take its list or seqential URLs to be crawled from the URL server?
My guess would be that the best way to handle it would be to keep dragging its urls from the url server, adding new urls as it finds them as it goes.
About the Duplicate content and Google. I wish Google would do something about the duplicate content in the search results. Do a Google search for "new york real estate law" and you will see on the first 5 pages, about half the sites are Identical, but have different domain names. All the titles will say "Real Estate 8". Do a search on "Real Estate 8" and Google has 5,110 results. Somebody definatly found a way to exploit Google. They are all link directories that all go to a another link page, etc. The sites are just traffic grabbers, but yet Google is spitting out tons of them in the search results.
I have to agree and disagree with nohaber on a few things though. Expertise can be a very relative and highly overused term. However, not everyone that got into the game early wants to wind up working for a major corporation but rather build something for himself. I know a lot of people I consider experts. But the majority of those calling themselves that ... no, you're right they are not.
When you look at old you need to differentiate between out dated and distinguished. This is more the second one.