help with preg_match

g_bot Well-Known Member

Messages:: 248

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 150

#1

Hey guys i am learning php by working on a scrapper.

I already have made few scrappers but i can't seem to get this one working.

  

function search( $keyword, $page = 1 ) {
            
            $keyword = urlencode( $keyword );
            $page = urlencode( $page );
            
            /*<li style="padding-bottom:15px;"><a  style="font-weight:bold;" href="/alcohol+treatment+program+the+best+answer+for+alcoholics-157800" title="Alcohol Treatment Program: The Best Answer For Alcoholics">Alcohol Treatment Program: The Best Answer For Alcoholics</a><br><span style="color:#aaaaaa">Date: 09.10.2009 | Author: <a style="color:#aaaaaa" href="/author-kvnsmith456.html">kvnsmith456</a> | <a href="/fitness.html">Fitness</a></span><div>in the present world, alcohol addiction is one of the important factors for the creating nuisance in the social and economic lives. the alcoholic people are not only creating distress for themselves b...</div></li>*/
            
            preg_match_all('/<li style="padding-bottom:15px;">[ ]*<a  style="font-weight:bold;" href="([^"\n]*) title="([^"\n]*)"/s', $this->request( "{$this->searchUrl}{$keyword}-$page}" ), $matches );
            
         echo "{$this->searchUrl}{$keyword}-{$page}";
            
            $return = array();
            
            foreach( $matches[2] as $key => $value ) {
                
                $return[] = array(
                    'title' => $matches[1][ $key ],
                    'url' => $value
                );
                
            }
            
            return $return;
            
        }

PHP:

As you can I am trying to parse title and url(href) out of highlighted source code but it just wont work.

I'll appreciate if some one can fix this and/or explain few thing about preg_match_all to me.

PS: you can ask me for high pr backlinks in return.

g_bot, Oct 9, 2009 IP

JAY6390 Peon

Messages:: 918

Likes Received:: 31

Best Answers:: 0

Trophy Points:: 0

#2


$pattern = '%<a[^>]*? href="([^"]+)"[^>]*>([^<]+)</a>%si';
preg_match_all($pattern, $subject, $matches);

$return = array();
foreach($matches[1] as $k=>$v)
{
	$return[] = array(
		'url' => $matches[1][$k],
		'title' => $matches[2][$k],
	);
}

PHP:

JAY6390, Oct 9, 2009 IP

g_bot Well-Known Member

Messages:: 248

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 150

#3

wow that was quick thanks,

I am gonna test it out and post an update in a minute

g_bot, Oct 9, 2009 IP

JAY6390 Peon

Messages:: 918

Likes Received:: 31

Best Answers:: 0

Trophy Points:: 0

#4

hehe gotta do something to cure my boredom

JAY6390, Oct 9, 2009 IP

w47w47 Peon

Messages:: 255

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#5

put an example how the url looks like on the remote site. so that we know which string to match. :>

w47w47, Oct 9, 2009 IP

w47w47 Peon

Messages:: 255

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#6

ah... i didn't read the whole post lol sry... i think that JAY6390 answered your question.

w47w47, Oct 9, 2009 IP

JAY6390 Peon

Messages:: 918

Likes Received:: 31

Best Answers:: 0

Trophy Points:: 0

#7

yup that should cover it. That regex matches any <a></a> tag so long as it contains a href="..." in it

JAY6390, Oct 9, 2009 IP

g_bot Well-Known Member

Messages:: 248

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 150

#8

Okay i tried running the script with

$pattern = '%<a[^>]*? href="([^"]+)"[^>]*>([^<]+)</a>%si';
preg_match_all($pattern, $this->request( "{$this->searchUrl}{$keyword}-$page}" ), $matches);
Click to expand...

still it doesn't seem to work as page continues to load, even though i am using a keyword with only 24 results. (we are parsing a search result page here) http://www.articlepool.com/tag-ninja

Any one like to fix the script for me for some money?

Last edited: Oct 9, 2009

g_bot, Oct 9, 2009 IP

JAY6390 Peon

Messages:: 918

Likes Received:: 31

Best Answers:: 0

Trophy Points:: 0

#9

So you are just wanting the search results, not every link on the page?

JAY6390, Oct 9, 2009 IP

g_bot Well-Known Member

Messages:: 248

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 150

#10

yes and thats why i was trying to parse links only inside <li style="padding-bottom:15px;"> </li> (please see the source code at http://www.articlepool.com/tag-ninja)

I could have hired a coder to fix these issues but i am trying to learn PHP myself, i hope you guys don't mind help me here

g_bot, Oct 9, 2009 IP

JAY6390 Peon

Messages:: 918

Likes Received:: 31

Best Answers:: 0

Trophy Points:: 0

#11

ok change the pattern to

$pattern = '%<li style="padding-bottom:15px;"><a[^>]*? href="([^"]+)"[^>]*>(.*?)</a>%si';

PHP:

and change this line

'title' => $matches[2][$k],

PHP:

to

'title' => strip_tags($matches[2][$k]),

PHP:

JAY6390, Oct 9, 2009 IP

g_bot Well-Known Member

Messages:: 248

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 150

#12

Thanks a lot Jay, it works now, i was able to parse all Titles and urls.

If its not too much to ask can you also tell me how can i also parse the article body. i have it working for another similar website can't get it to work for this one.( the one i did has simpler structure though)

Its inside the "<div style="float:left; padding-bottom:5px; padding-top:5px; padding-right:10px; clear:both;">" and ends with <b>Rate this article:</b>
$temp = $this->request( "{$this->baseUrl}{$article['url']}" );
                    
                    $temp = explode( '<div style="float: left; padding-bottom: 5px; padding-top: 5px; padding-right: 10px; clear: both;">', $temp );
                    $temp = explode( '<b>Rate this article:</b>', $temp[1] );
                    
                    $article['body'] = $temp[0];
                    
                    $return[] = $article;
PHP:
Thanks again

g_bot, Oct 9, 2009 IP

JAY6390 Peon

Messages:: 918

Likes Received:: 31

Best Answers:: 0

Trophy Points:: 0

#13

Here's the regex for it
'%<div style="float:left; padding-bottom:5px; padding-top:5px; padding-right:10px; clear:both;">(.*?)<b>Rate this article:</b>%si'

JAY6390, Oct 9, 2009 IP

JAY6390 Peon

Messages:: 918

Likes Received:: 31

Best Answers:: 0

Trophy Points:: 0

#14

You'd be best using preg_match for it, and using strip_tags again on the content to remove any unwanted characters

JAY6390, Oct 9, 2009 IP

g_bot Well-Known Member

Messages:: 248

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 150

#15

Update: I got it script is working smoothly now.

Thank Jay you are da man!

g_bot, Oct 9, 2009 IP

JAY6390 Peon

Messages:: 918

Likes Received:: 31

Best Answers:: 0

Trophy Points:: 0

#16

$temp = $this->request( "{$this->baseUrl}{$article['url']}" );
$pattern = '%<div style="float:left; padding-bottom:5px; padding-top:5px; padding-right:10px; clear:both;">(.*?)<b>Rate this article:</b>%si';
preg_match($pattern, $temp, $matches);
$return = array(strip_tags($matches[1]));

return $return;

PHP:

Change your code to that
I am guessing there is only one article?

JAY6390, Oct 9, 2009 IP

g_bot Well-Known Member

Messages:: 248

Likes Received:: 1

Best Answers:: 0

Trophy Points:: 150

#17

I used this instead and it parses ALL articles from search result page.

$temp = $this->request( "{$this->baseUrl}{$article['url']}" );
					
					$temp = explode( '<div style="float:left; padding-bottom:5px; padding-top:5px; padding-right:10px; clear:both;">', $temp );
					$temp = explode( '<b>Rate this article:</b>', $temp[1] );

PHP:

Repped you brother

g_bot, Oct 9, 2009 IP

JAY6390 Peon

Messages:: 918

Likes Received:: 31

Best Answers:: 0

Trophy Points:: 0

#18

Excellent

JAY6390, Oct 9, 2009 IP

Log in or Sign up

help with preg_match_all

g_bot Well-Known Member

JAY6390 Peon

g_bot Well-Known Member

JAY6390 Peon

w47w47 Peon

w47w47 Peon

JAY6390 Peon

g_bot Well-Known Member

JAY6390 Peon

g_bot Well-Known Member

JAY6390 Peon

g_bot Well-Known Member

JAY6390 Peon

JAY6390 Peon

g_bot Well-Known Member

JAY6390 Peon

g_bot Well-Known Member

JAY6390 Peon

Log in or Sign up

help with preg_match_all

g_bot Well-Known Member

JAY6390 Peon

g_bot Well-Known Member

JAY6390 Peon

w47w47 Peon

w47w47 Peon

JAY6390 Peon

g_bot Well-Known Member

JAY6390 Peon

g_bot Well-Known Member

JAY6390 Peon

g_bot Well-Known Member

JAY6390 Peon

JAY6390 Peon

g_bot Well-Known Member

JAY6390 Peon

g_bot Well-Known Member

JAY6390 Peon

Useful Searches