Well........There are millions and billions of sites and increasing day by day so its really hard to check duplicacy manually or by hiring employees and in this case filters and spideres are good option and no doubt google doing the same- using filters and spiders to check duplicacy in content.
Indeed....Copyscape is really a good tool to check duplicacy.....I often use to check that no body copy my articles etc.....
How would a spider know which one is the original content and which was copied? I also think that checking first 10 results of major keywords for spam can be done manually by google employees.
the spider does not know. It does not have to. The software that acts on that information does. one view is http://searchengineoptimization.ell... a Search Engine Determines Duplicate Content there is archive.org, although i am sure google would have it's own better INTERNAL version of archive.org anyway, to check for how long a link is on a page, how often content changes and MANY other things. Yes they could do that, but a computer program works quicker, costs less, does not have to go on vacation etc. Humans will be LAST in the chain after all checks by the computer program still shows something iffy, else it would cost too much.