RegEx to grab first paragraph of Wikipedia, help! $$'s

Discussion in 'PHP' started by x11joex11, Dec 8, 2007.

  1. #1
    Hey there, I'm going to supply some sample wikipedia articles.

    http://en.wikipedia.org/wiki/Andre_Agassi
    http://en.wikipedia.org/wiki/Björn_Borg
    http://en.wikipedia.org/wiki/Rod_Laver
    http://en.wikipedia.org/wiki/Arthur_Ashe
    http://en.wikipedia.org/wiki/Charlotte_Cooper_(tennis)
    http://en.wikipedia.org/wiki/Australia
    http://en.wikipedia.org/wiki/Afghanistan
    http://en.wikipedia.org/wiki/New_York_City
    http://en.wikipedia.org/wiki/Herbert_Barrett
    http://en.wikipedia.org/wiki/Boris_Becker

    (take note that this time it's not bold in the first paragraph)

    http://en.wikipedia.org/wiki/Ismail_El_Shafei

    (example of an error)

    http://en.wikipedia.org/wiki/Heinz_GŸnthardt

    My problem is I've been trying to write a regEX for preg_match command in php to grab the first paragraph of each of these pages.

    To save you some time I'll send you the code I have so far for doing the testing. All you need to do is fill an array $searchArray with a bunch of results for it to cycle through to make testing work for the items above.

    function loadSearchArray($textFile)
    {
    	if(file_exists($textFile))
    	{
    		echo "<br>Found file $textFile <BR><BR>";
    		return (file($textFile));//returns an array which should be the loaded version of the file
    	}
    	else
    	{
    		die("Text File, $textFile, Not Found!");
    	}
    }
    
    function cleanWikiText($string)
    {
    	$string=preg_replace('/\\[[0-9]+\\]/s',' ',$string);//get rid of [1][2][#] etc..
    	$string=preg_replace('/\\((help|info).*?\\)/s',' ',$string);//get rid of (help.info)
    	return $string;
    }
    
    header('Content-Type: text/html; charset=utf-8');
    $searchArray=loadSearchArray($_GET['textfile']);//loads the searchArray with the file information
    
    foreach ($searchArray as $line_num => $line) 
    {
    	$URL="http://en.wikipedia.org/wiki/".$line;
    	$URL=str_replace(" ","_",$URL);//spaces have to change to underscores it will error~
    	$URL=utf8_encode(rtrim($URL));//any other hidden characters and extra space are trimmed~
    	
    	
    	//$URL=utf8_encode("http://en.wikipedia.org/wiki/Heinz_G¸nthardt");
    	
    	$parts = parse_url($URL);
    	$URL = $parts['scheme'].'://'.$parts['host'].str_replace('%2F','/',urlencode($parts['path'])).($parts['query']?'?'.$parts['query']:'');
    
        echo "<br><br>Scanning: " . $URL . " ->";
        
        $ch = curl_init($URL);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    	curl_setopt($ch, CURLOPT_HEADER, 1);
        $result=curl_exec($ch);
    		
    	//Error Handeling OR Process URL
    	if($result==true)
    	{
    		echo "URL passes Check";
    		
    		preg_match('/<h1 class="firstHeading">(.+?)<\/h1>/s',$result, $matches);
    		$entity_name = $matches[1];
    		echo "<br>Entity_Name = $entity_name <br>";
    		
    		preg_match('/<h1 class="firstHeading">.+?<\/h1>.+?<!-- start content -->.+?<p>.*?<b>(.+?)<\/p>/s',$result,$matches);
    		$first_paragraph = strip_tags($matches[1]);
    		$first_paragraph = cleanWikiText($first_paragraph);//gets rid of [1] (help.info) things~
    		
    		echo "First_Paragraph = $first_paragraph<br>";
    		
    		
    		//echo $result;
    	}
    	else
    	{
    		echo "<b>Error Loading URL</b>";
    	}
        
        
        //exit;//for now so it only does it once~
    }
    PHP:
    An example of the script I made working is here, http://dnfinder.net/rentacoder/wikigrab.php?textfile=sample_articles_revised.txt

    My example just fills the array by loading it from a text file on my server, you can grab that text file if you want, just look at the location it's pointing to.

    If someone can help me to find a regEX that does the job correctly, I would greatly appreciate it, I'm so confused because of how random wiki can be about the placement of things (that is why the samples I showed you should differ a lot, I think that is every different possible thing that can happen).

    I can afford up to $25+ (more if it's difficult) for assistance!

    AIM:x11joex11

    Best,
    - Joe~
     
    x11joex11, Dec 8, 2007 IP
  2. decepti0n

    decepti0n Peon

    Messages:
    519
    Likes Received:
    16
    Best Answers:
    0
    Trophy Points:
    0
    #2
    If you have PHP5, try using XPath

    Here's a quick example, although it only gets the text content of the paragraph (so you wont get the links etc). Also it seems to have some issues with characters not showing up right ( like ö). Tested it on Agassi and Borg and it worked

    <?php
    
    $dom = new DomDocument();
    @$dom->loadHTMLFile('http://en.wikipedia.org/wiki/Andre_Agassi');
    
    // Xpath
    $xpath = new DomXPath($dom);
    
    // First para
    $result = $xpath->query('//div[@id="bodyContent"]/p');
    echo $result->item(0)->nodeValue;
    
    ?>
    PHP:
    I stopped testing since wiki then kept coming up with a 403 error, so consider getting the source of the page a different way :p

    Edit: Just checked again on Rod Laver and it'll get the paragraph that says "For the blah blah, see blah blah". The only way I figure to get the right one is to get the first p element, following a table (which I don't know the syntax for atm)
     
    decepti0n, Dec 8, 2007 IP
  3. x11joex11

    x11joex11 Peon

    Messages:
    106
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    Ah okay, you are right, dom is an option possibly. The only problem is there isn't always a table before it. Sometimes there is no table, that is what makes this difficult, I need help writing a regEX with conditionals in it I think.
     
    x11joex11, Dec 8, 2007 IP
  4. x11joex11

    x11joex11 Peon

    Messages:
    106
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Well I finally came up with a regEX that I tested in regEXbuddy to grab any kind of paragraph, but unfortunatly the sub expressions didn't work.

    The regEX is here,

    <!-- start content --(?(?=.\s\s\t\t\t<table class)(>.+?</table>\s+<p>(.+?)</p>)|(>.+?<p>(?(?=<i>)(.+?<p><b>(.+?)</p>)|((.+?</p>)))))

    If you know what I might have messed up in let me know. If you test this against source code from wikipedia pages you will see it grabs the information every time regardless of the situation, just it doesn't work in PHP for some reason (the sub expressions anyways that is).

    Here is an example of the script in action I just made a loop that went through the subExpressions to test it.

    http://dnfinder.net/rentacoder/wikigrab.php?textfile=sample_articles_revised.txt

    preg_match('/<!-- start content --(?(?=.\\s\\s\\t\\t\\t<table class)(>.+?<\/table>\\s+<p>(.+?)<\/p>)|(>.+?<p>(?(?=<i>)(.+?<p><b>(.+?)<\/p>)|((.+?<\/p>)))))/s',$result,$matches);
    		
    		for($i=0;$i<=10;$i++)
    		{
    			$first_paragraph = strip_tags($matches[$i]);
    			$first_paragraph = cleanWikiText($first_paragraph);//gets rid of [1] (help.info) things~
    			echo "<b>$i</b> First_Paragraph = $first_paragraph<br>";
    		}
    PHP:
     
    x11joex11, Dec 8, 2007 IP