Hi guys, I've been searching for a function to get the first few lines (everything before the Table of Contents in fact) of any Wikipedia article automatically, preferably using PHP. I've implemented a solution to do this for a client of mine a while ago, but it has always had its shortcomings and it is still far from perfect. Does anyone have experience with this issue as well? Can you point me to a link where I can find a better solution? I know about the Wikipedia API but it isn't helping much either. Every and any help is appreciated in this matter. Thanks in the advance, DXL
Not particularly hard, grab the page and run a regex. But please read this first: http://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler
The part I want is the regular text above the Table of Contents, not including any blocks or notices. A combination of regular expressions is my current solution, but I'm hoping there is something better or a well-tested publicly available script. Thanks to both of you.
Depending on how many articles you intend to get extracts from, I wonder if it might be 'cleaner' to download the database and work off of the actual data itself?
It's long and drawn out but this should accomplish what you are looking for wikipedia('http://en.wikipedia.org/wiki/Shawn_Hogan') ; function wikipedia($page) { $article = $page ; $pattern[0] = '/<a href="(.*?)">(.*?)<\\/a>/'; $replace[0] = '$2'; $pattern[1] = '/<h3 id=\"siteSub\">From Wikipedia, the free encyclopedia<\/h3>/'; $replace[1] = ''; $pattern[2] = '/<div id=\"contentSub\">(.*?)<\/div><div id=\"jump-to-nav\">Jump to: navigation, search<\/div>/'; $replace[2] = ''; $pattern[3] = '/<div class=\"messagebox cleanup metadata\">(.*?)<p><br \/><\/p>/'; $replace[3] = ''; $pattern[4] = '/<table class=\"messagebox\" (.*?)>(.*?)<\/table>/'; $replace[4] = ''; $pattern[5] = '/<dl>(.*?)<\/dl>/'; $replace[5] = ''; $pattern[6] = '/<h1 class=\"firstHeading"\>(.*?)<\/h1>/'; $replace[6] = '<h3>$1</h3>'; $pattern[7] = '/<table class=\"messagebox protected\" style=\"border: 1px solid #8888aa; padding: 0px; font-size:9pt;\">(.*?)<\/table>/'; $replace[7] = ''; $pattern[8] = '/<div class=\"infobox sisterproject\">(.*?)<\/div><\/div>/'; $replace[8] = ''; $pattern[9] = '/<sup (.*?)>(.*?)<\/sup>/'; $replace[9] = ''; $pattern[10] = '/<table style=\"background: transparent;\" width=\"0\">(.*?)<\/table>/'; $replace[10] = ''; $pattern[11] = '/<table class=\"messagebox current\" style=\"font-size: normal;\">(.*?)<\/table>/'; $replace[11] = ''; $pattern[12] = '/<table class=\"toccolours\" align=\"center\" width=\"55%\" cellpadding=\"0\" cellspacing=\"0\">(.*?)<\/table>/'; $replace[12] = ''; $pattern[13] = '/<div class=\"editsection\"(.*?)>(.*?)<\/div>/'; $replace[13] = ''; $pattern[14] = '/<div id=\"bodyContent\">/'; $replace[14] = '<div>'; $pattern[15] = '/<dd>(.*?)<\/dd>/'; $replace[15] = ''; $pattern[16] = '/<div class=\"messagebox cleanup metadata\">(.*?)<\/div>/'; $replace[16] = ''; $pattern[17] = '/<div class=\"thumbcaption\">(.*?)<\/div><\/div>/'; $replace[17] = ''; $pattern[18] = '/<div class=\"thumb tright\">/'; $replace[18] = ''; $pattern[19] = '/\[(.*?)\]/'; $replace[19] = ''; $pattern[20] = '/<table class="messagebox protected" (.*?)>(.*?)<\/table>/'; $replace[20] = ''; $pattern[21] = '/<div style="position:absolute; z-index:100; right:20px; top:10px; height:10px; width:300px;"><\/div>/'; $replace[21] = ''; $pattern[22] = '/<div style="position:absolute; z-index:100; right:10px; top:10px;" class="metadata" id="administrator">(.*?)<\/div><\/div>/'; $replace[22] = ''; $pattern[23] = '/<table class="messagebox current"(.*?)>(.*?)<\/table>/'; $replace[23] = ''; $pattern[24] = '/<table class="messagebox current" style="width: auto;">(.*?)<\/table>/'; $replace[24] = ''; $pattern[25] = '/<div class="dablink">(.*?)<\/div>/'; $replace[25] = ''; $pattern[26] = '/<b>/'; $replace[26] = '<strong>'; $pattern[27] = '/<\/b>/'; $replace[27] = '</strong>'; $pattern[28] = '/<div(.*?)>/'; $replace[28] = ''; $pattern[29] = '/<\/div>/'; $replace[29] = ''; $pattern[30] = '/<map(.*?)>(.*?)<\/map>/'; $replace[30] = ''; $pattern[31] = '/<img src="(.*?)" alt="This page is semi-protected." width="18" (.*?)\/>/'; $replace[31] = ''; $pattern[32] = '/<table style="width:100%;background:none">(.*?)<\/table>/'; $replace[32] = ''; $pattern[33] = '/<div class="messagebox merge metadata">(.*?)<\/div>/'; $replace[33] = ''; $wikipedia = file_get_the_contents($article); $wikipedia = preg_replace($pattern, $replace, $wikipedia); if (preg_match("/<\!-- start content --\>(.*)<table id=\"toc\" class=\"toc\" summary=\"(.*)\">/", $wikipedia, $w)) { $wikipedia = $w[1]; } elseif (preg_match("/<\!-- start content --\>(.*)<a name=\"(.*)\">/is", $wikipedia, $w)) { $wikipedia = $w[1]; } elseif (preg_match("/<\!-- start content --\>(.*)<div class=\"boilerplate metadata\" id=\"stub\">/is", $wikipedia, $w)) { $wikipedia = $w[1]; } elseif (preg_match("/<\!-- start content --\>(.*)<div class=\"printfooter\">/is", $wikipedia, $w)) { $wikipedia = $w[1]; } preg_match("/\<p\>(.*)\<\/p\>/i", $wikipedia, $w) ; $wikipedia = $w[1] ; print $wikipedia; } function file_get_the_contents($url) { $ch = curl_init(); $timeout = 10; // set to zero for no timeout curl_setopt ($ch, CURLOPT_URL, $url); curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout); $file_contents = curl_exec($ch); curl_close($ch); return $file_contents; } PHP: Prints out something like Shawn D. Hogan (born September 1, 1975) is the founder and CEO of Digital Point Solutions, a San Diego-based business software provider. He became well-known when the article 'Shawn Hogan, Hero' appeared in the August 2006 edition of the magazine Wired, detailing his firm stand against an MPAA lawsuit. HTML: Hope this is what you want ... to get the whole article just remove this part preg_match("/\<p\>(.*)\<\/p\>/i", $wikipedia, $w) ; $wikipedia = $w[1] ; PHP:
Thanks a great deal ErectADirectory! Works great Is this something you wrote yourself or did you find it somewhere?
No problem, glad to help. I didn't write the original, found it online somewhere a good bit ago. I'm sure I hacked it out at some point though :]