Hey there, I'm going to supply some sample wikipedia articles. http://en.wikipedia.org/wiki/Andre_Agassi http://en.wikipedia.org/wiki/Björn_Borg http://en.wikipedia.org/wiki/Rod_Laver http://en.wikipedia.org/wiki/Arthur_Ashe http://en.wikipedia.org/wiki/Charlotte_Cooper_(tennis) http://en.wikipedia.org/wiki/Australia http://en.wikipedia.org/wiki/Afghanistan http://en.wikipedia.org/wiki/New_York_City http://en.wikipedia.org/wiki/Herbert_Barrett http://en.wikipedia.org/wiki/Boris_Becker (take note that this time it's not bold in the first paragraph) http://en.wikipedia.org/wiki/Ismail_El_Shafei (example of an error) http://en.wikipedia.org/wiki/Heinz_GŸnthardt My problem is I've been trying to write a regEX for preg_match command in php to grab the first paragraph of each of these pages. To save you some time I'll send you the code I have so far for doing the testing. All you need to do is fill an array $searchArray with a bunch of results for it to cycle through to make testing work for the items above. function loadSearchArray($textFile) { if(file_exists($textFile)) { echo "<br>Found file $textFile <BR><BR>"; return (file($textFile));//returns an array which should be the loaded version of the file } else { die("Text File, $textFile, Not Found!"); } } function cleanWikiText($string) { $string=preg_replace('/\\[[0-9]+\\]/s',' ',$string);//get rid of [1][2][#] etc.. $string=preg_replace('/\\((help|info).*?\\)/s',' ',$string);//get rid of (help.info) return $string; } header('Content-Type: text/html; charset=utf-8'); $searchArray=loadSearchArray($_GET['textfile']);//loads the searchArray with the file information foreach ($searchArray as $line_num => $line) { $URL="http://en.wikipedia.org/wiki/".$line; $URL=str_replace(" ","_",$URL);//spaces have to change to underscores it will error~ $URL=utf8_encode(rtrim($URL));//any other hidden characters and extra space are trimmed~ //$URL=utf8_encode("http://en.wikipedia.org/wiki/Heinz_G¸nthardt"); $parts = parse_url($URL); $URL = $parts['scheme'].'://'.$parts['host'].str_replace('%2F','/',urlencode($parts['path'])).($parts['query']?'?'.$parts['query']:''); echo "<br><br>Scanning: " . $URL . " ->"; $ch = curl_init($URL); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_HEADER, 1); $result=curl_exec($ch); //Error Handeling OR Process URL if($result==true) { echo "URL passes Check"; preg_match('/<h1 class="firstHeading">(.+?)<\/h1>/s',$result, $matches); $entity_name = $matches[1]; echo "<br>Entity_Name = $entity_name <br>"; preg_match('/<h1 class="firstHeading">.+?<\/h1>.+?<!-- start content -->.+?<p>.*?<b>(.+?)<\/p>/s',$result,$matches); $first_paragraph = strip_tags($matches[1]); $first_paragraph = cleanWikiText($first_paragraph);//gets rid of [1] (help.info) things~ echo "First_Paragraph = $first_paragraph<br>"; //echo $result; } else { echo "<b>Error Loading URL</b>"; } //exit;//for now so it only does it once~ } PHP: An example of the script I made working is here, http://dnfinder.net/rentacoder/wikigrab.php?textfile=sample_articles_revised.txt My example just fills the array by loading it from a text file on my server, you can grab that text file if you want, just look at the location it's pointing to. If someone can help me to find a regEX that does the job correctly, I would greatly appreciate it, I'm so confused because of how random wiki can be about the placement of things (that is why the samples I showed you should differ a lot, I think that is every different possible thing that can happen). I can afford up to $25+ (more if it's difficult) for assistance! AIM:x11joex11 Best, - Joe~
If you have PHP5, try using XPath Here's a quick example, although it only gets the text content of the paragraph (so you wont get the links etc). Also it seems to have some issues with characters not showing up right ( like ö). Tested it on Agassi and Borg and it worked <?php $dom = new DomDocument(); @$dom->loadHTMLFile('http://en.wikipedia.org/wiki/Andre_Agassi'); // Xpath $xpath = new DomXPath($dom); // First para $result = $xpath->query('//div[@id="bodyContent"]/p'); echo $result->item(0)->nodeValue; ?> PHP: I stopped testing since wiki then kept coming up with a 403 error, so consider getting the source of the page a different way Edit: Just checked again on Rod Laver and it'll get the paragraph that says "For the blah blah, see blah blah". The only way I figure to get the right one is to get the first p element, following a table (which I don't know the syntax for atm)
Ah okay, you are right, dom is an option possibly. The only problem is there isn't always a table before it. Sometimes there is no table, that is what makes this difficult, I need help writing a regEX with conditionals in it I think.
Well I finally came up with a regEX that I tested in regEXbuddy to grab any kind of paragraph, but unfortunatly the sub expressions didn't work. The regEX is here, <!-- start content --(?(?=.\s\s\t\t\t<table class)(>.+?</table>\s+<p>(.+?)</p>)|(>.+?<p>(?(?=<i>)(.+?<p><b>(.+?)</p>)|((.+?</p>))))) If you know what I might have messed up in let me know. If you test this against source code from wikipedia pages you will see it grabs the information every time regardless of the situation, just it doesn't work in PHP for some reason (the sub expressions anyways that is). Here is an example of the script in action I just made a loop that went through the subExpressions to test it. http://dnfinder.net/rentacoder/wikigrab.php?textfile=sample_articles_revised.txt preg_match('/<!-- start content --(?(?=.\\s\\s\\t\\t\\t<table class)(>.+?<\/table>\\s+<p>(.+?)<\/p>)|(>.+?<p>(?(?=<i>)(.+?<p><b>(.+?)<\/p>)|((.+?<\/p>)))))/s',$result,$matches); for($i=0;$i<=10;$i++) { $first_paragraph = strip_tags($matches[$i]); $first_paragraph = cleanWikiText($first_paragraph);//gets rid of [1] (help.info) things~ echo "<b>$i</b> First_Paragraph = $first_paragraph<br>"; } PHP: