I want to set up a niche wiki, and seed it with data from wikipedia. I know this is ok under the terms of licence, but am looking for advice/help from anyone who knows how to do it. I understand there is a massive wiki dump with ALL the dat, but I only want to use certain sections of it. Can anyone give any advice or guidance or a point in the right directoion of scripts that will help me achieve this. Thanks in advance
Thanks for that, but looking at it, it works on the whole dump There are other scripts that do that already. I know I can scrape the data, but that is not the done thing, I want to make use of it properly.
You can download the wikipedia data dumps from wikipedia. http://en.wikipedia.org/wiki/Wikipedia:Download You then install MediaWiki on your own server and run the import script included with MediaWiki. The directions are on the Wikipedia site. The whole process took several days on my PIII 900 to import all the articles (there are over 1 million) but once it was loaded it seems to be working smoothly enough. Since Wikipedia is not organized by categories it'll be very difficult to do a niche version of it. If you only want certain pages you'll most likely have to manually grab the page source and post it into your own wiki page. You could just as well take the whole thing and then set up a blog on the front page and link to the articles you're interested in. Create categories for your blog entries to index the pages for your niche that way. Google loves the blog+wiki combo And with a million+ articles you'll do rather well in search engines if you can get Google to index more than just the articles you point to with the blog.
You can export all the pages you need en.wikipedia.org/wiki/Special:Export Then import them onto your wiki. There is a problem in that you will have to manually grab and upload the pictures to your own server. You can mass export by going to a category page on wikipedia and copying all the pages listed in that category and pasting them into the export form.
Run that by me again? I visited that page, but it says it has been disabled. How about this idea then, Does anyone know of a wiki scraper script I can use? I will download the dump, install it on my OWN server in media wiki, then I can scrape the data from MY OWN version of wikipedia. I have emphasised my own, as I have absolutely no intention of scraping live wiki data, but I have a feling this is the easiest way to do it. To this end I will mirror wiki on a server of my own, so it is only my own bandwidth and resources that are being scraped.
Why don't you just write a script that opens the file, parses the XML and then checks if the "text" element contains a reference to your niche's category?