Hey guys i am learning php by working on a scrapper. I already have made few scrappers but i can't seem to get this one working. function search( $keyword, $page = 1 ) { $keyword = urlencode( $keyword ); $page = urlencode( $page ); /*<li style="padding-bottom:15px;"><a style="font-weight:bold;" href="/alcohol+treatment+program+the+best+answer+for+alcoholics-157800" title="Alcohol Treatment Program: The Best Answer For Alcoholics">Alcohol Treatment Program: The Best Answer For Alcoholics</a><br><span style="color:#aaaaaa">Date: 09.10.2009 | Author: <a style="color:#aaaaaa" href="/author-kvnsmith456.html">kvnsmith456</a> | <a href="/fitness.html">Fitness</a></span><div>in the present world, alcohol addiction is one of the important factors for the creating nuisance in the social and economic lives. the alcoholic people are not only creating distress for themselves b...</div></li>*/ preg_match_all('/<li style="padding-bottom:15px;">[ ]*<a style="font-weight:bold;" href="([^"\n]*) title="([^"\n]*)"/s', $this->request( "{$this->searchUrl}{$keyword}-$page}" ), $matches ); echo "{$this->searchUrl}{$keyword}-{$page}"; $return = array(); foreach( $matches[2] as $key => $value ) { $return[] = array( 'title' => $matches[1][ $key ], 'url' => $value ); } return $return; } PHP: As you can I am trying to parse title and url(href) out of highlighted source code but it just wont work. I'll appreciate if some one can fix this and/or explain few thing about preg_match_all to me. PS: you can ask me for high pr backlinks in return.
$pattern = '%<a[^>]*? href="([^"]+)"[^>]*>([^<]+)</a>%si'; preg_match_all($pattern, $subject, $matches); $return = array(); foreach($matches[1] as $k=>$v) { $return[] = array( 'url' => $matches[1][$k], 'title' => $matches[2][$k], ); } PHP:
yup that should cover it. That regex matches any <a></a> tag so long as it contains a href="..." in it
Okay i tried running the script with still it doesn't seem to work as page continues to load, even though i am using a keyword with only 24 results. (we are parsing a search result page here) http://www.articlepool.com/tag-ninja Any one like to fix the script for me for some money?
yes and thats why i was trying to parse links only inside <li style="padding-bottom:15px;"> </li> (please see the source code at http://www.articlepool.com/tag-ninja) I could have hired a coder to fix these issues but i am trying to learn PHP myself, i hope you guys don't mind help me here
ok change the pattern to $pattern = '%<li style="padding-bottom:15px;"><a[^>]*? href="([^"]+)"[^>]*>(.*?)</a>%si'; PHP: and change this line 'title' => $matches[2][$k], PHP: to 'title' => strip_tags($matches[2][$k]), PHP:
Thanks a lot Jay, it works now, i was able to parse all Titles and urls. If its not too much to ask can you also tell me how can i also parse the article body. i have it working for another similar website can't get it to work for this one.( the one i did has simpler structure though) Its inside the "<div style="float:left; padding-bottom:5px; padding-top:5px; padding-right:10px; clear:both;">" and ends with <b>Rate this article:</b> $temp = $this->request( "{$this->baseUrl}{$article['url']}" ); $temp = explode( '<div style="float: left; padding-bottom: 5px; padding-top: 5px; padding-right: 10px; clear: both;">', $temp ); $temp = explode( '<b>Rate this article:</b>', $temp[1] ); $article['body'] = $temp[0]; $return[] = $article; PHP: Thanks again
Here's the regex for it '%<div style="float:left; padding-bottom:5px; padding-top:5px; padding-right:10px; clear:both;">(.*?)<b>Rate this article:</b>%si'
You'd be best using preg_match for it, and using strip_tags again on the content to remove any unwanted characters
$temp = $this->request( "{$this->baseUrl}{$article['url']}" ); $pattern = '%<div style="float:left; padding-bottom:5px; padding-top:5px; padding-right:10px; clear:both;">(.*?)<b>Rate this article:</b>%si'; preg_match($pattern, $temp, $matches); $return = array(strip_tags($matches[1])); return $return; PHP: Change your code to that I am guessing there is only one article?
I used this instead and it parses ALL articles from search result page. $temp = $this->request( "{$this->baseUrl}{$article['url']}" ); $temp = explode( '<div style="float:left; padding-bottom:5px; padding-top:5px; padding-right:10px; clear:both;">', $temp ); $temp = explode( '<b>Rate this article:</b>', $temp[1] ); PHP: Repped you brother