help with preg_match_all

Discussion in 'PHP' started by g_bot, Oct 9, 2009.

  1. #1
    Hey guys i am learning php by working on a scrapper.

    I already have made few scrappers but i can't seem to get this one working.

      
    
    function search( $keyword, $page = 1 ) {
                
                $keyword = urlencode( $keyword );
                $page = urlencode( $page );
                
                /*<li style="padding-bottom:15px;"><a  style="font-weight:bold;" href="/alcohol+treatment+program+the+best+answer+for+alcoholics-157800" title="Alcohol Treatment Program: The Best Answer For Alcoholics">Alcohol Treatment Program: The Best Answer For Alcoholics</a><br><span style="color:#aaaaaa">Date: 09.10.2009 | Author: <a style="color:#aaaaaa" href="/author-kvnsmith456.html">kvnsmith456</a> | <a href="/fitness.html">Fitness</a></span><div>in the present world, alcohol addiction is one of the important factors for the creating nuisance in the social and economic lives. the alcoholic people are not only creating distress for themselves b...</div></li>*/
                
                preg_match_all('/<li style="padding-bottom:15px;">[ ]*<a  style="font-weight:bold;" href="([^"\n]*) title="([^"\n]*)"/s', $this->request( "{$this->searchUrl}{$keyword}-$page}" ), $matches );
                
             echo "{$this->searchUrl}{$keyword}-{$page}";
                
                $return = array();
                
                foreach( $matches[2] as $key => $value ) {
                    
                    $return[] = array(
                        'title' => $matches[1][ $key ],
                        'url' => $value
                    );
                    
                }
                
                return $return;
                
            }
    
    
    
    PHP:
    As you can I am trying to parse title and url(href) out of highlighted source code but it just wont work.

    I'll appreciate if some one can fix this and/or explain few thing about preg_match_all to me.

    PS: you can ask me for high pr backlinks in return.
     
    g_bot, Oct 9, 2009 IP
  2. JAY6390

    JAY6390 Peon

    Messages:
    918
    Likes Received:
    31
    Best Answers:
    0
    Trophy Points:
    0
    #2
    
    $pattern = '%<a[^>]*? href="([^"]+)"[^>]*>([^<]+)</a>%si';
    preg_match_all($pattern, $subject, $matches);
    
    $return = array();
    foreach($matches[1] as $k=>$v)
    {
    	$return[] = array(
    		'url' => $matches[1][$k],
    		'title' => $matches[2][$k],
    	);
    }
    
    PHP:
     
    JAY6390, Oct 9, 2009 IP
  3. g_bot

    g_bot Well-Known Member

    Messages:
    248
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    150
    #3
    wow that was quick thanks,

    I am gonna test it out and post an update in a minute
     
    g_bot, Oct 9, 2009 IP
  4. JAY6390

    JAY6390 Peon

    Messages:
    918
    Likes Received:
    31
    Best Answers:
    0
    Trophy Points:
    0
    #4
    hehe gotta do something to cure my boredom :D
     
    JAY6390, Oct 9, 2009 IP
  5. w47w47

    w47w47 Peon

    Messages:
    255
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #5
    put an example how the url looks like on the remote site. so that we know which string to match. :>
     
    w47w47, Oct 9, 2009 IP
  6. w47w47

    w47w47 Peon

    Messages:
    255
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #6
    ah... i didn't read the whole post lol sry... i think that JAY6390 answered your question. ;)
     
    w47w47, Oct 9, 2009 IP
  7. JAY6390

    JAY6390 Peon

    Messages:
    918
    Likes Received:
    31
    Best Answers:
    0
    Trophy Points:
    0
    #7
    yup that should cover it. That regex matches any <a></a> tag so long as it contains a href="..." in it
     
    JAY6390, Oct 9, 2009 IP
  8. g_bot

    g_bot Well-Known Member

    Messages:
    248
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    150
    #8
    Okay i tried running the script with
    still it doesn't seem to work as page continues to load, even though i am using a keyword with only 24 results. (we are parsing a search result page here) http://www.articlepool.com/tag-ninja

    Any one like to fix the script for me for some money? :D
     
    Last edited: Oct 9, 2009
    g_bot, Oct 9, 2009 IP
  9. JAY6390

    JAY6390 Peon

    Messages:
    918
    Likes Received:
    31
    Best Answers:
    0
    Trophy Points:
    0
    #9
    So you are just wanting the search results, not every link on the page?
     
    JAY6390, Oct 9, 2009 IP
  10. g_bot

    g_bot Well-Known Member

    Messages:
    248
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    150
    #10
    yes and thats why i was trying to parse links only inside <li style="padding-bottom:15px;"> </li> (please see the source code at http://www.articlepool.com/tag-ninja)

    I could have hired a coder to fix these issues but i am trying to learn PHP myself, i hope you guys don't mind help me here :)
     
    g_bot, Oct 9, 2009 IP
  11. JAY6390

    JAY6390 Peon

    Messages:
    918
    Likes Received:
    31
    Best Answers:
    0
    Trophy Points:
    0
    #11
    ok change the pattern to
    $pattern = '%<li style="padding-bottom:15px;"><a[^>]*? href="([^"]+)"[^>]*>(.*?)</a>%si';
    PHP:
    and change this line
    'title' => $matches[2][$k],
    PHP:
    to
    'title' => strip_tags($matches[2][$k]),
    PHP:
     
    JAY6390, Oct 9, 2009 IP
  12. g_bot

    g_bot Well-Known Member

    Messages:
    248
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    150
    #12
    Thanks a lot Jay, it works now, i was able to parse all Titles and urls.

    If its not too much to ask can you also tell me how can i also parse the article body. i have it working for another similar website can't get it to work for this one.( the one i did has simpler structure though)

    Its inside the "<div style="float:left; padding-bottom:5px; padding-top:5px; padding-right:10px; clear:both;">" and ends with <b>Rate this article:</b>

    $temp = $this->request( "{$this->baseUrl}{$article['url']}" );
                        
                        $temp = explode( '<div style="float: left; padding-bottom: 5px; padding-top: 5px; padding-right: 10px; clear: both;">', $temp );
                        $temp = explode( '<b>Rate this article:</b>', $temp[1] );
                        
                        $article['body'] = $temp[0];
                        
                        $return[] = $article;
    PHP:
    Thanks again
     
    g_bot, Oct 9, 2009 IP
  13. JAY6390

    JAY6390 Peon

    Messages:
    918
    Likes Received:
    31
    Best Answers:
    0
    Trophy Points:
    0
    #13
    Here's the regex for it
    '%<div style="float:left; padding-bottom:5px; padding-top:5px; padding-right:10px; clear:both;">(.*?)<b>Rate this article:</b>%si'
     
    JAY6390, Oct 9, 2009 IP
  14. JAY6390

    JAY6390 Peon

    Messages:
    918
    Likes Received:
    31
    Best Answers:
    0
    Trophy Points:
    0
    #14
    You'd be best using preg_match for it, and using strip_tags again on the content to remove any unwanted characters
     
    JAY6390, Oct 9, 2009 IP
  15. g_bot

    g_bot Well-Known Member

    Messages:
    248
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    150
    #15
    Update: I got it script is working smoothly now.

    Thank Jay you are da man!
     
    g_bot, Oct 9, 2009 IP
  16. JAY6390

    JAY6390 Peon

    Messages:
    918
    Likes Received:
    31
    Best Answers:
    0
    Trophy Points:
    0
    #16
    $temp = $this->request( "{$this->baseUrl}{$article['url']}" );
    $pattern = '%<div style="float:left; padding-bottom:5px; padding-top:5px; padding-right:10px; clear:both;">(.*?)<b>Rate this article:</b>%si';
    preg_match($pattern, $temp, $matches);
    $return = array(strip_tags($matches[1]));
    
    return $return;
    PHP:
    Change your code to that
    I am guessing there is only one article?
     
    JAY6390, Oct 9, 2009 IP
  17. g_bot

    g_bot Well-Known Member

    Messages:
    248
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    150
    #17
    I used this instead and it parses ALL articles from search result page.

    $temp = $this->request( "{$this->baseUrl}{$article['url']}" );
    					
    					$temp = explode( '<div style="float:left; padding-bottom:5px; padding-top:5px; padding-right:10px; clear:both;">', $temp );
    					$temp = explode( '<b>Rate this article:</b>', $temp[1] );
    PHP:
    Repped you brother :)
     
    g_bot, Oct 9, 2009 IP
  18. JAY6390

    JAY6390 Peon

    Messages:
    918
    Likes Received:
    31
    Best Answers:
    0
    Trophy Points:
    0
    #18
    Excellent :)
     
    JAY6390, Oct 9, 2009 IP