Surely not - that would mean that if a duplicate document was found, it would be marked never to be crawled again, and the author could never amend it, make it non-duplicate, and then have it re-indexed.
Well then I geuss we disagree on bots then. IMO bots are not intelligent programs, they are optimized for one task and one task only, go to a page, query it to see if it has been updated since the last spidering, if so read the page and send it to the repository, then go to the next link and repeat the process. Adding a dup content filter to the bots would seriously slow the crawling process, and at any rate you can do the same thing much more efficiently by running the program over the data base, where you have much more processing power and better bandwidth. To suggest that once a page is detected as a duplicate it would never be crawled again, is contrary to what search engines are set up to so. Did it ever occur to you that pages change? One of the features of local rank is that it removes all but one page from any one Class C ip address and only ranks the remaining pages. Obviously all internal links and pageas are from the same IP address. Agreed, but you are the one who saysLocalrank is being used, not me, and if it is being used, then by your logic it must be making a substantial difference to the rankings, but I do not see any such result. Of course its not who ever said it was??? And you are taking this to mean that the duplicate content filter is based soley on the page title and the SERP snippet?? We must be reading totally different patents. No that it makes any material difference to the discussion, but a masters degree in engineering, and twenty years experience in the computer field. I am not a programmer, I hire programmers. Insofar as algo background six years of intensive study of just about everything I can find on the subject. Now my question, why are you so programmer centric? Are you not aware that there are other skills in the world?
Mel, I don't know where to start You mix up the terminology a bit. Crawling is a function of search engines. It is not simply fetching pages. And Google's crawling function is VERY intelligent. Google needs to give a priority to each page, so that low PR pages are crawled once a month, while high PR pages may get crawled a couple of times a day. Read "Efficient Crawling Through URL Ordering". There are other papers on this subject and it is quite an important to search engines. If a search engine crawls the wrong pages, it will lead to very slow updates. You can't call crawling unintelligent. By bots you probably mean the URLServers. They are just a small part of crawling. After a page is fetched, fingertips (as Google calls them) are generated and compared to the fingertip database. If a document is found duplicate as another one, its affiliate or groupid or whatever you call it is set the same as the other document. This is done by a union-find operation. If 2 docs have the same groupID, then they are duplicate. At some point later, you may separately run a duplicate site detection application by checking the groupIDs and the linking structure. I didn't say it would never be crawled again. It is not ordinary to start a site with duplicate pages, and after that developing/changing content. That's plain stupid. Google has the full right to ignore such pages, and use its resources for unique content. If a webmaster makes this kind of stupidity, he will simply need to change the page content and name, and link to it from a page that's not labeled duplicate Of course, dup-doc-detection is part of the crawling facility. As I said earlier, read the damn papers, look at the images etc. Internal links don't count in LocalRank, but they DO count in the OldScore. LocalRank is meant to increase the relevancy of the pages, who have non-affiliated links from other very ranked pages. How do you know that? As I said, read the god damn patentss. There are two types of dup-content. 1. Duplicate and near-dup documents when two WHOLE docs are dup. 2. QUERY-SPECIFIC dup. documents. Two docs may be quite different, but if they have 3 paragraphs in common, and the query keywords are found there, google may decide that they are query-specific dups. There's a separate patent of query-spec. doc detection. It's part of the ranking function. I am not programmer centric. Programming is just one of the things you need to become a SEO expert. You need a substancial experience solving complex algorithmic problems. I know it. I have participated in loads of programming contests and I have seen great programmers solving nothing, because they don't know algorithms. Programming is a skill to code. Algorithms is another animal. If you work at Google, you have to have it all. If you invent some great way to increase the relevancy and it turns out that it needs two minutes to execute a query, you have simply wasted time. I have had my ass wooped not a single time on competition and I have the utmost respect to guys who have the background to do it. On competitions, you are given 5 to 7 tasks, a time limit of lets say 1 second, and every problem has just one efficient way to be solved. If you don't know that algorithm, you may be the best programmer in the world, but you can't solve it to run under 1 second. You get 0 points. The best high-school and university programmers and algorithmically trained people participate in competitions, and a lot of times more than 50% of all teams solve 0 problems. I don't understand the proof of maybe 50% of all the more complicated algorithms that I've used at competitions, but I can code them correctly. Now, if some idiot comes to me, and tells me he is not a programmer but knows how to solve such tasks, I tell him he is full of it. Neither you, or I, have the needed background to speak authoratively on what google does. I have a lot of algo/progr. experience, but I lack the specific information, SE engineers have. They have read, tried, experimented, coded etc for years. I have only read the public literature. If I make a great software program, and some self-proclaimed never-before-programmed, let alone know anything about algos, start yapping around, explaining to other people how my program works, I'll be either rofling or get appalled. I can only imagine how Google guys and gals laugh at the pathetic efforts of SEO experts to explain Google. That's why I am very careful at what I say. I asked you about your background, because a programmer would never use your language and terms to try to explain certain things in Google. It is just obvious. I am in no way trying to insult anyone. I am just saying that we shouldn't talk too much about things we don't know and attach "expert" "guru" or whatever nonsense to us. We are all Search Engine Amateurs and we are chatting here about search engines
Yes I've read the papers, but we seem to read them diffently. When I say bots I mean bots and when I say url servers I mean URL servers. While it is true that the URL servers have mechanisms for prioritizing the crawling order this does not make the bots intelligent, they are a seperate function/program. They are called fingerprints not finger tips by the way. May I suggest that you reread them? Here is a quote from Googles Patent number 6,659,423 on duplicate page detection (Specifications Page 21): The document first lists the five primary methods in which the invention may be used and then states: From this it is apparent to me that this is not a system primarily proposed for use during crawling, although that is a possible use of it, and it certainly presents no proof that the patent is used by bots at the time they spider the pages, as you seem to imply. Yes I think thats what I said, but I do not think it increases the relevany of the pages but perhaps the results. If you set the variables in Local Rank you can have either oldscore or local rank as more or less important, and as you say there would be no logical reason to implement localrank if it were not going to be an important part of the ranking system, therefor I ask again what is your evidence that local rank is in fact in use? Fine you have your ideas of what an expert is and I have mine, but I can assure you that simply being a programmer will not make you a successful SEO, and that is a subject in which I believe I have enough experience to discuss with authority. No one said we were search engine experts, or search engine amateurs, we are not trying to build a search engine, we are discussing SEO here, but just having programming ability does not make anyone a search engine expert. There are hundreds of search engines coded every year, some never see the light of day, some do but die after a few months, and some may succeed, but IMO those that do succeed with have that rare combination ofa vision of what a search engine should do, mathmatical skills, Programming skills, and the business acumen to make it viable. At any rate this discussion has yielded nothing new or any evidence that Localrank is in fact in use, etc etc, and thus there are better uses for large blocks of time such as this.
Let me cite the original paper: The 3 major applications are crawling, indexing and searching. Where do you think is the duplicate-doc detection? Even if the detection information is used later, the ACTUAL detection is part of the crawling application. And the crawling application IS intelligent. The query specific doc detection is in the searching application. Here's a link to the second patent: Detecting query-specific duplicate documents I've never said that. I just said the programming/algos is a prerequisite to be a SEO expert, and lacking any of the two automatically means you can't understand SEs. I have evidence. I have monitored "diet software" and "fitness software" keyphrases for enough a period of time and LocalRank IMO is used. Example: The only site that does not have a DMOZ listing is mine, and although I have much better SEO that my competitors, a lot of them outrank me because they have a link from DMOZ from a page that already ranks high for fitness software. Example 2: If LocalRank is not used, then the top results will be the so-called authority sites or DMOZ and similar directories. But because these auth. sites link to the actual competitors, they push them up with LocalRank, so the user don't get a bunch of directory listings, but the actual sites. That's IMO one of the purposes of LocalRank. Example 3: the #1 ranked site for Search Engine Optimization has links from many other highly ranked pages for the term. If LocalRank is used, they'll get the highest LocalRank score.
I see a statement but nothing new to back it up. While they choose to discuss the 3 major applications, there could be and arein fact hundreds of other applications in use. And what does that have to do with crawling or our discussion? Well your are entitled to your opinion and I to mine. I am not a programmer and I seem to be able to discuss this topic with you, but to be brutally frank the word that comes to mind is bullshit. If this is the best evidence you have then its back to the drawing board as far as I'm concerned. I do not agree with either of your conclusions, nor do I see anything that could not be caused by nine or ten other things.
LOL. Let's call them thousands Maybe we should reinvent the SE field. I guess we were also discussing dup-doc detection, the two types of dup-docs etc. Nevermind. The word here is "seem". It's OK with me that you call me a bullshiter. You have no idea of what I have observed for these queries, but you made a conclusion just on one sentence. That's OK with me. Have a nice day
Just a simple question: Suppose I have a page of 1000 words and I copy a few sentences on a different page, would that be seen as duplicate content? Or suppose I have a pdf document of a few hundred pages and copy a few pages on another page, would that be duplicate content to the spiders?
I'm sure that the few sentences from a larger article would not be considered bad duplicate content otherwise how could any web page offer "quotes" from other articles on the web?
In this situation you have different documents. They aren't duplicate. If someone is searching for keywords that are used in the few duplicate sentences, Google may filter one of the pages from the SERPs for the given query.
I won't go down the ethics path. You own the site/s, you do what ever the hell you want to with them. End of story. If the search engines make money indirectly off the content they cache, good for them. As for penalties, well that really makes the point doesn't it. If you're good at what you do, you won't be penalized. At the end of the day, and as G well knows, it comes down to scale of ownership. Whether it's 100 webmasters all looking to seo their individual sites, or one webmaster looking to seo 100 sites, it's still a network. The approach just changes, that's all. If I'm giving all parties what they want (the bots, the audience, the stakeholders, and my banker), then I'm happy Now all I gotta do is figure out this network management thingy
And the search engines have the right to index you or drop you depending on if they like what you do. I guess we define networks differently, and BTW Google does care if you use a network of sites to get your rankings and has removed many sites from the rankings for that. Lots of luck.
Errr, sorry, but I have to say something to this. Having a simple hash in the database field for a URL with the pages pr/last time of crawl is simple and checking against that each crawl is even simpler to determine what pages to crawl and what pages not to crawl. This is not how google does it at all, however to say that the spider needs to be intelligent is a joke. It does not, it's coders may be, but the damn spider is still just a program. As for the duplicate filter, we were the first to report its existence with documentation, and I have not found anyone else that has tracked its usage at all, except to say, Oh my I was hit by it. The dupe content filter is exactly that. A filter which is run at periodic times across the index. Nothing more, and certainly not a core agent. If you read the method of fingerprinting dupe pages a little harder you would be able to see the patent is not made for an ongoing process.
If a software program is called let's say "Product X" and it implements an algorithm that does something intelligently, you can say that "Product X" is intelligent. I wouldn't call URL ordering in crawling a simple application. If it were simple, no one would be publishing papers on it. The more the web grows, the more important crawling becomes. Your description lacks at least a couple of factors: what if a server is down? what about a page age, # of slashes in the URL, new pages or old page etc. You'll also have to make the ordering such that the crawling wouldn't get stuck at crawling only high PR pages and ignoring new/low PR ones. There are many details. You can't say it is simple. First of what/who? How do you track the usage of a dup filter? You are right. The way the exemplary implementations are described, it would suggest that it is periodically run. It can also be made as a constantly ongoing process. It all depends on how Google implemented it. There are pros and cons to both approaches. Let's take a look at them: 1) periodically run application The pros: code separation and easier programming, less storage space (removing unique fingerprints) The cons: if you remove fingerprints, you'd have to generate them on the next pass. That requires more CPU power. what about newly found pages? They'd have to wait until the next dup-doc pass. So Google would have to risk showing dup-docs, or just wait until the next pass. That would slow down the time a page needs to get in the index: it will have to wait for the next dup-doc pass, next it would have to get parsed,indexed, sorted and put in the inverted index. That is not a bad feature if you want to delay new pages from getting ranked quickly. Also you can pair the dup-doc pass with PR recalculations and maybe other stuff before major updates. A periodically run application on the whole repository will probably take days, and it would be done once in a couple of weeks. 2) constantly ongoing process The pros: dup documents would be detected on the fly and can get indexed more quickly The cons: modification of currently used and well-tested code more data storage Which of the two does Google use? I don't know. Maybe you are right. Maybe, it is a periodically run application, although I wouldn't bet 100% on it. It is not the patent, it is Google that decides in the end. If it is periodically run, we can expect to have slower and slower ranking time for new pages as the web grows.
People have published papers on the proper use of <b> tags. I have written more than my share of spidering applications, I can definately say it is not difficult. I think in your arguement that you forget that Gbot is not actually only one bot on one machine. Google uses quite a few Gbot apps on a host of machines. For all we know each one may do a certain PageRank, or they may even co-exist in a communicable relationship. Either way, URL ordering is not difficult. This would take way too long to type out fully so let me just refer you to a URL in PM. I dont want to drop another forum url in open thread. With the intelligence of your posts that I have seen so far, it should not take a quantum leap for you to figure that question out.
LOL. Give me one of these. I'd like to take a good laugh. Have you written a large-scale SE crawling application dealing with millions of URLs? I haven't and I personally consider this a very difficult programming/algorithmic task. The actual code for fetching pages is easy stuff. What is difficult is: "Hey, I have 5 millions of pages in the index + 1 millions discovered/still not crawled URLs. Which URLs should I crawl first?". That's difficult, IMO. The URL you sent me is from "Apr 10, 2004". Duplicate-detection has been a part of Google a long way before that date. In April 2004, Google might have improved their algorithm, but dup-detection is one of the first tasks one should take care. Let me cite Jon Glick from Yahoo: "You know, if you have a great relevancy algorithm and lousy Spam detection you just get a bad experience for instance."
Actually that was the very first instance of anyone reporting widespread duplicate content causing sites to vanish from the index en masse with proof on any forum or article, I even checked to make sure It is already widely known that googles spam detection really is almost non-existant. The only real times spam is removed, it is not done algorythmically but manually in a spam campaign for a certain type of spam. I am surprised you think they actually have good spam detection. Take for instance the well known common spam types, hidden text and keyword spamming, nothing is done to these in the algo. Google knows they exist and seeing as these could be beaten by a bit of code in the spider, but are not, it stands to reason google has decided they are not feasable at runtime for whatever reason, be it resource drain, memory issues, or whatever. We see periodic campaigns to deal with each of these, or manual dealings when reported sometimes. 1 mil + yes. Billions, no, which is what google has indexed. 4.2x billion to be precise according to the numbers on their site. And it is still not a hard thing to do at all. The hard part would be the database itself. I take it then that you are not a coder. Assume for a minute that the spider in question is using the method of multi instances for each PR level, and our example bot is doing PR4, the bot simply pulls from all pr4 listings in order and crawls them. This is no difficult task. It is a robot, it does what it is told, nothing more. Assume, before you ask the question, that it can not crawl a URL, it simply drops it into a queue for later crawling when it frees up some processing time. Again, not difficult. You are giving spiders more credit than they are due at this date and time. As much as you would like them to be all-knowing and all-seeing, they are not. Wishful thinking is great but it does not help when the real world just doesn't work that way.
If Google didn't have some sort of dup-doc detection before April, we would have been seeing lots duplicate content in the results (as was with M$ tech preview). I was talking about using dup-docs to artificially hoard PR as a spam technique. As far as hidden text, it is not detected because the Google's parser does not use exeternal CSS files. That's a bit strange to me, but they have decided not to take external CSS files into consideration. That's why I've always wondered why people think that H tags are magical. Google does not use external CSS, and it must guess the font sizes of tags. It depends. If you have thousands of servers and bandwidth, you can write a simple iterative crawling algorithm. But if the number of pages to crawl is more than you can, you have to guess which is better to crawl first. I am actually a coder. I have participated many programming contests in my high-school/university years(have national/univ awards as well as a first place at an international conference of young scientists with a data compression program back in '96). Some of the tasks that I've solved would have been trivial if I had lots of RAM for example. But, sometimes resources are not enough. The web grows very fast and the number of URLs grows quite fast. Are you saying that Google has enough resources to crawl all of them every day? No. That's the main concern. A crawling application has to make smart decisions. It has to work very well with scarce resources and do well in the ever growing web. Let me again cite a paper from Google guys: "The design of a good crawler presents many challenges. Externally, the crawler must avoid overloading Web sites or network links as it goes about its business [Kos95]. Internally, the crawler must deal with huge volumes of data. Unless it has unlimited computing resources and unlimited time, it must carefully decide what URLs to scan and in what order. The crawler must also decide how frequently to revisit pages it has already seen, in order to keep its client informed of changes on the Web." (emphasis mine)
You are making it sound like it needs to crawl all 4.2x billion pages daily. It does not, it bases frequency of crawls mainly on pagerank and number of incoming links to the page. As to google having the resources, I know if I had a 10,000 server farm, I could. Yes that decision is based on PR and IBL's. Not intelligent spiders. I actually thought you might have some background in it. However based on your answers, I think it is safe to say you probably have no real world experience with spidering applications.