How to make use of wiki data?

Discussion in 'Programming' started by Old Welsh Guy, Jun 16, 2006.

  1. #1
    I want to set up a niche wiki, and seed it with data from wikipedia. I know this is ok under the terms of licence, but am looking for advice/help from anyone who knows how to do it.

    I understand there is a massive wiki dump with ALL the dat, but I only want to use certain sections of it. Can anyone give any advice or guidance or a point in the right directoion of scripts that will help me achieve this.

    Thanks in advance :)
     
    Old Welsh Guy, Jun 16, 2006 IP
  2. donteatchicken

    donteatchicken Well-Known Member

    Messages:
    432
    Likes Received:
    28
    Best Answers:
    0
    Trophy Points:
    118
    #2
    donteatchicken, Jun 18, 2006 IP
  3. Old Welsh Guy

    Old Welsh Guy Notable Member

    Messages:
    2,699
    Likes Received:
    291
    Best Answers:
    0
    Trophy Points:
    205
    #3
    Thanks for that, but looking at it, it works on the whole dump :( There are other scripts that do that already. I know I can scrape the data, but that is not the done thing, I want to make use of it properly.
     
    Old Welsh Guy, Jun 19, 2006 IP
  4. KalvinB

    KalvinB Peon

    Messages:
    2,787
    Likes Received:
    78
    Best Answers:
    0
    Trophy Points:
    0
    #4
    You can download the wikipedia data dumps from wikipedia.

    http://en.wikipedia.org/wiki/Wikipedia:Download

    You then install MediaWiki on your own server and run the import script included with MediaWiki. The directions are on the Wikipedia site.

    The whole process took several days on my PIII 900 to import all the articles (there are over 1 million) but once it was loaded it seems to be working smoothly enough.

    Since Wikipedia is not organized by categories it'll be very difficult to do a niche version of it. If you only want certain pages you'll most likely have to manually grab the page source and post it into your own wiki page.

    You could just as well take the whole thing and then set up a blog on the front page and link to the articles you're interested in. Create categories for your blog entries to index the pages for your niche that way.

    Google loves the blog+wiki combo

    And with a million+ articles you'll do rather well in search engines if you can get Google to index more than just the articles you point to with the blog.
     
    KalvinB, Jun 20, 2006 IP
  5. toykilla

    toykilla Peon

    Messages:
    99
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #5
    You can export all the pages you need

    en.wikipedia.org/wiki/Special:Export

    Then import them onto your wiki. There is a problem in that you will have to manually grab and upload the pictures to your own server.

    You can mass export by going to a category page on wikipedia and copying all the pages listed in that category and pasting them into the export form.
     
    toykilla, Jun 22, 2006 IP
  6. Old Welsh Guy

    Old Welsh Guy Notable Member

    Messages:
    2,699
    Likes Received:
    291
    Best Answers:
    0
    Trophy Points:
    205
    #6
    Run that by me again? I visited that page, but it says it has been disabled.

    How about this idea then, Does anyone know of a wiki scraper script I can use? I will download the dump, install it on my OWN server in media wiki, then I can scrape the data from MY OWN version of wikipedia.

    I have emphasised my own, as I have absolutely no intention of scraping live wiki data, but I have a feling this is the easiest way to do it. To this end I will mirror wiki on a server of my own, so it is only my own bandwidth and resources that are being scraped.
     
    Old Welsh Guy, Jun 22, 2006 IP
  7. DXL

    DXL Peon

    Messages:
    380
    Likes Received:
    21
    Best Answers:
    0
    Trophy Points:
    0
    #7
    Why don't you just write a script that opens the file, parses the XML and then checks if the "text" element contains a reference to your niche's category?
     
    DXL, Jun 22, 2006 IP