Is the extracted DMOZ Dataset (4.5M webpages) available somewhere? I know I can use some s/w for doing this...but someone must have done this before..any idea where can get it? Even a a list of terms extracted from each url is sufficient. Appreciate any help.
http://rdf.dmoz.org/] Keep in mind that your DMOZ clone will be considered duplicate content in the eyes of most search engines. Original content is king... start your own directory and work like hell. Or better, find a less saturated niche and work like hell... your rewards will be self evident.
I know this one. Actually these datasets have only the URLS, not the 'actual' contents of the URLs. I want the contents of the URLs (i.e. already crawled). Is that available anywhere?
That goes beyond the scope of ANY directory if I am following you correctly. What would you do with all that junk?
the extracted pages could form a document collection where web queries can be experimentally evaluated on different retrieval tasks. queries are available...not the document collection...i.e. extracted webpages.