1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

How to get an extract of a Wikipedia article with PHP?

Discussion in 'PHP' started by DXL, Jun 20, 2008.

  1. #1
    Hi guys,

    I've been searching for a function to get the first few lines (everything before the Table of Contents in fact) of any Wikipedia article automatically, preferably using PHP.
    I've implemented a solution to do this for a client of mine a while ago, but it has always had its shortcomings and it is still far from perfect.
    Does anyone have experience with this issue as well? Can you point me to a link where I can find a better solution?
    I know about the Wikipedia API but it isn't helping much either.
    Every and any help is appreciated in this matter.

    Thanks in the advance,
    DXL
     
    DXL, Jun 20, 2008 IP
  2. King Goilio

    King Goilio Member

    Messages:
    200
    Likes Received:
    6
    Best Answers:
    0
    Trophy Points:
    33
    #2
    Which bit exactly do you want from the page?
     
    King Goilio, Jun 20, 2008 IP
  3. Danltn

    Danltn Well-Known Member

    Messages:
    679
    Likes Received:
    36
    Best Answers:
    0
    Trophy Points:
    120
    #3
    Danltn, Jun 20, 2008 IP
  4. DXL

    DXL Peon

    Messages:
    380
    Likes Received:
    21
    Best Answers:
    0
    Trophy Points:
    0
    #4
    The part I want is the regular text above the Table of Contents, not including any blocks or notices.
    A combination of regular expressions is my current solution, but I'm hoping there is something better or a well-tested publicly available script.
    Thanks to both of you.
     
    DXL, Jun 20, 2008 IP
  5. TwistMyArm

    TwistMyArm Peon

    Messages:
    931
    Likes Received:
    44
    Best Answers:
    0
    Trophy Points:
    0
    #5
    Depending on how many articles you intend to get extracts from, I wonder if it might be 'cleaner' to download the database and work off of the actual data itself?
     
    TwistMyArm, Jun 20, 2008 IP
  6. ErectADirectory

    ErectADirectory Guest

    Messages:
    656
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #6
    It's long and drawn out but this should accomplish what you are looking for

    
    wikipedia('http://en.wikipedia.org/wiki/Shawn_Hogan') ;
    
    function wikipedia($page)	{
      
      $article = $page ;
    	$pattern[0] = '/<a href="(.*?)">(.*?)<\\/a>/';
    	$replace[0] = '$2';
    	$pattern[1] = '/<h3 id=\"siteSub\">From Wikipedia, the free encyclopedia<\/h3>/';
    	$replace[1] = '';
    	$pattern[2] = '/<div id=\"contentSub\">(.*?)<\/div><div id=\"jump-to-nav\">Jump to: navigation, search<\/div>/';
    	$replace[2] = '';
    	$pattern[3] = '/<div class=\"messagebox cleanup metadata\">(.*?)<p><br \/><\/p>/';
    	$replace[3] = '';
    	$pattern[4] = '/<table class=\"messagebox\" (.*?)>(.*?)<\/table>/';
    	$replace[4] = '';
    	$pattern[5] = '/<dl>(.*?)<\/dl>/';
    	$replace[5] = '';
    	$pattern[6] = '/<h1 class=\"firstHeading"\>(.*?)<\/h1>/';
    	$replace[6] = '<h3>$1</h3>';
    	$pattern[7] = '/<table class=\"messagebox protected\" style=\"border: 1px solid #8888aa; padding: 0px; font-size:9pt;\">(.*?)<\/table>/';
    	$replace[7] = '';
    	$pattern[8] = '/<div class=\"infobox sisterproject\">(.*?)<\/div><\/div>/';
    	$replace[8] = '';
    	$pattern[9] = '/<sup (.*?)>(.*?)<\/sup>/';
    	$replace[9] = '';
    	$pattern[10] = '/<table style=\"background: transparent;\" width=\"0\">(.*?)<\/table>/';
    	$replace[10] = '';
    	$pattern[11] = '/<table class=\"messagebox current\" style=\"font-size:	normal;\">(.*?)<\/table>/';
    	$replace[11] = '';
    	$pattern[12] = '/<table class=\"toccolours\" align=\"center\" width=\"55%\" cellpadding=\"0\" cellspacing=\"0\">(.*?)<\/table>/';
    	$replace[12] = '';
    	$pattern[13] = '/<div class=\"editsection\"(.*?)>(.*?)<\/div>/';
    	$replace[13] = '';
    	$pattern[14] = '/<div id=\"bodyContent\">/';
    	$replace[14] = '<div>';
    	$pattern[15] = '/<dd>(.*?)<\/dd>/';
    	$replace[15] = '';
    	$pattern[16] = '/<div class=\"messagebox cleanup metadata\">(.*?)<\/div>/';
    	$replace[16] = '';
    	$pattern[17] = '/<div class=\"thumbcaption\">(.*?)<\/div><\/div>/';
    	$replace[17] = '';
    	$pattern[18] = '/<div class=\"thumb tright\">/';
    	$replace[18] = '';
    	$pattern[19] = '/\[(.*?)\]/';
    	$replace[19] = '';
    	$pattern[20] = '/<table class="messagebox protected" (.*?)>(.*?)<\/table>/';
    	$replace[20] = '';
    	$pattern[21] = '/<div style="position:absolute; z-index:100; right:20px; top:10px; height:10px; width:300px;"><\/div>/';
    	$replace[21] = '';
    	$pattern[22] = '/<div style="position:absolute; z-index:100; right:10px; top:10px;" class="metadata" id="administrator">(.*?)<\/div><\/div>/';
    	$replace[22] = '';
    	$pattern[23] = '/<table class="messagebox current"(.*?)>(.*?)<\/table>/';
    	$replace[23] = '';
    	$pattern[24] = '/<table class="messagebox current" style="width: auto;">(.*?)<\/table>/';
    	$replace[24] = '';
    	$pattern[25] = '/<div class="dablink">(.*?)<\/div>/';
    	$replace[25] = '';
    	$pattern[26] = '/<b>/';
    	$replace[26] = '<strong>';
    	$pattern[27] = '/<\/b>/';
    	$replace[27] = '</strong>';
    	$pattern[28] = '/<div(.*?)>/';
    	$replace[28] = '';
    	$pattern[29] = '/<\/div>/';
    	$replace[29] = '';
    	$pattern[30] = '/<map(.*?)>(.*?)<\/map>/';
    	$replace[30] = '';
    	$pattern[31] = '/<img src="(.*?)" alt="This page is semi-protected." width="18" (.*?)\/>/';
    	$replace[31] = '';
    	$pattern[32] = '/<table style="width:100%;background:none">(.*?)<\/table>/';
    	$replace[32] = '';
    	$pattern[33] = '/<div class="messagebox merge metadata">(.*?)<\/div>/';
    	$replace[33] = '';
    	$wikipedia = file_get_the_contents($article);
    	$wikipedia = preg_replace($pattern, $replace, $wikipedia);
    		if (preg_match("/<\!-- start content --\>(.*)<table id=\"toc\" class=\"toc\" summary=\"(.*)\">/", $wikipedia, $w)) {
    			$wikipedia = $w[1];
    		} elseif (preg_match("/<\!-- start content --\>(.*)<a name=\"(.*)\">/is", $wikipedia, $w)) {
    			$wikipedia = $w[1];
    		} elseif (preg_match("/<\!-- start content --\>(.*)<div class=\"boilerplate metadata\" id=\"stub\">/is", $wikipedia, $w)) {
    			$wikipedia = $w[1];
    		} elseif (preg_match("/<\!-- start content --\>(.*)<div class=\"printfooter\">/is", $wikipedia, $w)) {
    			$wikipedia = $w[1];
    		}
    		
    preg_match("/\<p\>(.*)\<\/p\>/i", $wikipedia, $w) ;
    $wikipedia = $w[1] ;
    
    print $wikipedia;
    }
    
    function file_get_the_contents($url) {
      $ch = curl_init();
      $timeout = 10; // set to zero for no timeout
      curl_setopt ($ch, CURLOPT_URL, $url);
      curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
      curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
      $file_contents = curl_exec($ch);
      curl_close($ch);
      return $file_contents;
    }
    
    PHP:
    Prints out something like

    
    Shawn D. Hogan (born September 1, 1975) is the founder and CEO of Digital Point Solutions, a San Diego-based business software provider. He became well-known when the article 'Shawn Hogan, Hero' appeared in the August 2006 edition of the magazine Wired, detailing his firm stand against an MPAA lawsuit.
    HTML:
    Hope this is what you want ... to get the whole article just remove this part

    
    preg_match("/\<p\>(.*)\<\/p\>/i", $wikipedia, $w) ;
    $wikipedia = $w[1] ;
    
    PHP:
     
    ErectADirectory, Jun 20, 2008 IP
    DXL likes this.
  7. DXL

    DXL Peon

    Messages:
    380
    Likes Received:
    21
    Best Answers:
    0
    Trophy Points:
    0
    #7
    Thanks a great deal ErectADirectory! Works great :)
    Is this something you wrote yourself or did you find it somewhere?
     
    DXL, Jun 21, 2008 IP
  8. ErectADirectory

    ErectADirectory Guest

    Messages:
    656
    Likes Received:
    65
    Best Answers:
    0
    Trophy Points:
    0
    #8
    No problem, glad to help.

    I didn't write the original, found it online somewhere a good bit ago. I'm sure I hacked it out at some point though :]
     
    ErectADirectory, Jun 21, 2008 IP