DMOZ Extracted web pages

Discussion in 'ODP / DMOZ' started by akathi2, Jul 13, 2009.

  1. #1
    Is the extracted DMOZ Dataset (4.5M webpages) available somewhere? I know I can use some s/w for doing this...but someone must have done this before..any idea where can get it?

    Even a a list of terms extracted from each url is sufficient. Appreciate any help.
     
    akathi2, Jul 13, 2009 IP
  2. Qryztufre

    Qryztufre Prominent Member

    Messages:
    6,071
    Likes Received:
    491
    Best Answers:
    0
    Trophy Points:
    300
    #2
    http://rdf.dmoz.org/]

    Keep in mind that your DMOZ clone will be considered duplicate content in the eyes of most search engines. Original content is king... start your own directory and work like hell. Or better, find a less saturated niche and work like hell... your rewards will be self evident.
     
    Qryztufre, Jul 15, 2009 IP
  3. akathi2

    akathi2 Peon

    Messages:
    3
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    I know this one. Actually these datasets have only the URLS, not the 'actual' contents of the URLs.

    I want the contents of the URLs (i.e. already crawled).

    Is that available anywhere?
     
    akathi2, Jul 15, 2009 IP
  4. Qryztufre

    Qryztufre Prominent Member

    Messages:
    6,071
    Likes Received:
    491
    Best Answers:
    0
    Trophy Points:
    300
    #4
    That goes beyond the scope of ANY directory if I am following you correctly. What would you do with all that junk?
     
    Qryztufre, Jul 15, 2009 IP
  5. akathi2

    akathi2 Peon

    Messages:
    3
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #5
    the extracted pages could form a document collection where web queries can be experimentally evaluated on different retrieval tasks. queries are available...not the document collection...i.e. extracted webpages.
     
    akathi2, Jul 16, 2009 IP