Hey all, i was looking to develop a plagiarism checking software. Can any one give some basic idea how the functionality could be. Besides that can you recommend any online website to check plagiarism?
First, i would scrape the originals site content (strip the tags), then take random snppets from the content, like 8-12 words or so, maybe 3 or 4 blocks of these random snippets. Then do an exact search on google/yahoo/bing for each one. If matches come back, it probably is stolen content.
ya you need to depend on search engine for the result. You need to search for the give string in all major search engine if any one of the search engine show the result then that mean that article is stolen else it unique. But this totally depends if the search engine has index the article.
Thanks a lot yes copyscape is the best free web for this, and there are plenty of other free software, but dont you think if we brake the whole document of lets say 1000 words in to 100 queries of 10 words and we do this process for 20 articles, can you imagine the queries sent to google or other search engine... whats the possibility of getting blocked, since google is highly sensitive in this regard....
Simple just use different proxy servers and emulate different webbrowsers (new browser & proxy each query)...