DMOZ Extracted web pages

akathi2 Peon

Messages:: 3

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#1

Is the extracted DMOZ Dataset (4.5M webpages) available somewhere? I know I can use some s/w for doing this...but someone must have done this before..any idea where can get it?

Even a a list of terms extracted from each url is sufficient. Appreciate any help.

akathi2, Jul 13, 2009 IP

Qryztufre Prominent Member

Messages:: 6,071

Likes Received:: 491

Best Answers:: 0

Trophy Points:: 300

#2

http://rdf.dmoz.org/]

Keep in mind that your DMOZ clone will be considered duplicate content in the eyes of most search engines. Original content is king... start your own directory and work like hell. Or better, find a less saturated niche and work like hell... your rewards will be self evident.

Qryztufre, Jul 15, 2009 IP

akathi2 Peon

Messages:: 3

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#3

Qryztufre said: ↑

http://rdf.dmoz.org/]

Keep in mind that your DMOZ clone will be considered duplicate content in the eyes of most search engines. Original content is king... start your own directory and work like hell. Or better, find a less saturated niche and work like hell... your rewards will be self evident.
Click to expand...

I know this one. Actually these datasets have only the URLS, not the 'actual' contents of the URLs.

I want the contents of the URLs (i.e. already crawled).

Is that available anywhere?

akathi2, Jul 15, 2009 IP

Qryztufre Prominent Member

Messages:: 6,071

Likes Received:: 491

Best Answers:: 0

Trophy Points:: 300

#4

akathi2 said: ↑

I know this one. Actually these datasets have only the URLS, not the 'actual' contents of the URLs.

I want the contents of the URLs (i.e. already crawled).

Is that available anywhere?
Click to expand...

That goes beyond the scope of ANY directory if I am following you correctly. What would you do with all that junk?

Qryztufre, Jul 15, 2009 IP

akathi2 Peon

Messages:: 3

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#5

Qryztufre said: ↑

That goes beyond the scope of ANY directory if I am following you correctly. What would you do with all that junk?
Click to expand...

the extracted pages could form a document collection where web queries can be experimentally evaluated on different retrieval tasks. queries are available...not the document collection...i.e. extracted webpages.

akathi2, Jul 16, 2009 IP

Log in or Sign up

DMOZ Extracted web pages

akathi2 Peon

Qryztufre Prominent Member

akathi2 Peon

Qryztufre Prominent Member

akathi2 Peon

Useful Searches