RegEx to grab first paragraph of Wikipedia, help! $$'s

x11joex11 Peon

Messages:: 106

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#1

Hey there, I'm going to supply some sample wikipedia articles.

http://en.wikipedia.org/wiki/Andre_Agassi
http://en.wikipedia.org/wiki/BjÃ¶rn_Borg
http://en.wikipedia.org/wiki/Rod_Laver
http://en.wikipedia.org/wiki/Arthur_Ashe
http://en.wikipedia.org/wiki/Charlotte_Cooper_(tennis)
http://en.wikipedia.org/wiki/Australia
http://en.wikipedia.org/wiki/Afghanistan
http://en.wikipedia.org/wiki/New_York_City
http://en.wikipedia.org/wiki/Herbert_Barrett
http://en.wikipedia.org/wiki/Boris_Becker

(take note that this time it's not bold in the first paragraph)

http://en.wikipedia.org/wiki/Ismail_El_Shafei

(example of an error)

http://en.wikipedia.org/wiki/Heinz_GÅ¸nthardt

My problem is I've been trying to write a regEX for preg_match command in php to grab the first paragraph of each of these pages.

To save you some time I'll send you the code I have so far for doing the testing. All you need to do is fill an array $searchArray with a bunch of results for it to cycle through to make testing work for the items above.
function loadSearchArray($textFile)
{
	if(file_exists($textFile))
	{
		echo " Found file $textFile ";
		return (file($textFile));//returns an array which should be the loaded version of the file
	}
	else
	{
		die("Text File, $textFile, Not Found!");
	}
}

function cleanWikiText($string)
{
	$string=preg_replace('/\\[[0-9]+\\]/s',' ',$string);//get rid of [1][2][#] etc..
	$string=preg_replace('/\$(help|info).*?\$/s',' ',$string);//get rid of (help.info)
	return $string;
}

header('Content-Type: text/html; charset=utf-8');
$searchArray=loadSearchArray($_GET['textfile']);//loads the searchArray with the file information

foreach ($searchArray as $line_num => $line) 
{
	$URL="http://en.wikipedia.org/wiki/".$line;
	$URL=str_replace(" ","_",$URL);//spaces have to change to underscores it will error~
	$URL=utf8_encode(rtrim($URL));//any other hidden characters and extra space are trimmed~
	
	
	//$URL=utf8_encode("http://en.wikipedia.org/wiki/Heinz_GÂ¸nthardt");
	
	$parts = parse_url($URL);
	$URL = $parts['scheme'].'://'.$parts['host'].str_replace('%2F','/',urlencode($parts['path'])).($parts['query']?'?'.$parts['query']:'');

 echo " Scanning: " . $URL . " ->";
 
 $ch = curl_init($URL);
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	curl_setopt($ch, CURLOPT_HEADER, 1);
 $result=curl_exec($ch);
		
	//Error Handeling OR Process URL
	if($result==true)
	{
		echo "URL passes Check";
		
		preg_match('/<h1 class="firstHeading">(.+?)<\/h1>/s',$result, $matches);
		$entity_name = $matches[1];
		echo " Entity_Name = $entity_name ";
		
		preg_match('/<h1 class="firstHeading">.+?<\/h1>.+?.+?.*?(.+?)<\/p>/s',$result,$matches);
		$first_paragraph = strip_tags($matches[1]);
		$first_paragraph = cleanWikiText($first_paragraph);//gets rid of [1] (help.info) things~
		
		echo "First_Paragraph = $first_paragraph ";
		
		
		//echo $result;
	}
	else
	{
		echo "Error Loading URL";
	}
 
 
 //exit;//for now so it only does it once~
}
PHP:
An example of the script I made working is here, http://dnfinder.net/rentacoder/wikigrab.php?textfile=sample_articles_revised.txt

My example just fills the array by loading it from a text file on my server, you can grab that text file if you want, just look at the location it's pointing to.

If someone can help me to find a regEX that does the job correctly, I would greatly appreciate it, I'm so confused because of how random wiki can be about the placement of things (that is why the samples I showed you should differ a lot, I think that is every different possible thing that can happen).

I can afford up to $25+ (more if it's difficult) for assistance!

AIM:x11joex11

Best,
- Joe~

x11joex11, Dec 8, 2007 IP

decepti0n Peon

Messages:: 519

Likes Received:: 16

Best Answers:: 0

Trophy Points:: 0

#2

If you have PHP5, try using XPath

Here's a quick example, although it only gets the text content of the paragraph (so you wont get the links etc). Also it seems to have some issues with characters not showing up right ( like Ã¶). Tested it on Agassi and Borg and it worked
<?php

$dom = new DomDocument();
@$dom->loadHTMLFile('http://en.wikipedia.org/wiki/Andre_Agassi');

// Xpath
$xpath = new DomXPath($dom);

// First para
$result = $xpath->query('//div[@id="bodyContent"]/p');
echo $result->item(0)->nodeValue;

?>
PHP:
I stopped testing since wiki then kept coming up with a 403 error, so consider getting the source of the page a different way

Edit: Just checked again on Rod Laver and it'll get the paragraph that says "For the blah blah, see blah blah". The only way I figure to get the right one is to get the first p element, following a table (which I don't know the syntax for atm)

decepti0n, Dec 8, 2007 IP

x11joex11 Peon

Messages:: 106

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#3

Ah okay, you are right, dom is an option possibly. The only problem is there isn't always a table before it. Sometimes there is no table, that is what makes this difficult, I need help writing a regEX with conditionals in it I think.

x11joex11, Dec 8, 2007 IP

x11joex11 Peon

Messages:: 106

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#4

Well I finally came up with a regEX that I tested in regEXbuddy to grab any kind of paragraph, but unfortunatly the sub expressions didn't work.

The regEX is here,

<!-- start content --(?(?=.\s\s\t\t\t<table class)(>.+?</table>\s+(.+?))|(>.+?(?(?=)(.+?(.+?))|((.+?)))))

If you know what I might have messed up in let me know. If you test this against source code from wikipedia pages you will see it grabs the information every time regardless of the situation, just it doesn't work in PHP for some reason (the sub expressions anyways that is).

Here is an example of the script in action I just made a loop that went through the subExpressions to test it.

http://dnfinder.net/rentacoder/wikigrab.php?textfile=sample_articles_revised.txt
preg_match('/<!-- start content --(?(?=.\\s\\s\\t\\t\\t<table class)(>.+?<\/table>\\s+(.+?)<\/p>)|(>.+?(?(?=)(.+?(.+?)<\/p>)|((.+?<\/p>)))))/s',$result,$matches);
		
		for($i=0;$i<=10;$i++)
		{
			$first_paragraph = strip_tags($matches[$i]);
			$first_paragraph = cleanWikiText($first_paragraph);//gets rid of [1] (help.info) things~
			echo "$i First_Paragraph = $first_paragraph ";
		}
PHP:

x11joex11, Dec 8, 2007 IP

Log in or Sign up

RegEx to grab first paragraph of Wikipedia, help! $$'s

x11joex11 Peon

decepti0n Peon

x11joex11 Peon

x11joex11 Peon

Useful Searches