[WTB] PHP Scraper Script

Discussion in 'PHP' started by GoldenGrahams, Aug 5, 2011.

  1. #1
    I need a php script that will go to: http://www.gumtree.com/business-services

    gather all the urls for all the adverts and load each one of them up through a scraper which I need to extract phone numbers out of the pages. I only want uk mobile numbers which start in 07 and are 11 digits long.

    Eg:
    07593 354084
    0754 492 2461
    07529392193
    07523 45 56 32
    07564-435-239
    07639-232-432

    I need it to pick up all of the mobile numbers, however they are formatted. Please give me a quote.
     
    GoldenGrahams, Aug 5, 2011 IP
  2. bogi

    bogi Well-Known Member

    Messages:
    482
    Likes Received:
    16
    Best Answers:
    2
    Trophy Points:
    140
    #2
    Do you want to store them in database, download in csv or text file or just display them on your screen?
    If it's still available, you can contact me with your budget. It's really easy to do, so it can be done within hours.
     
    bogi, Aug 5, 2011 IP
  3. echipvina

    echipvina Active Member

    Messages:
    145
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    51
    #3
    You can try it
    
    preg_match('{07([0-9-\s]+){9,11}}', file_get_contents($url), $matchesarray);
    var_dump($matchesarray);
    
    PHP:
    [​IMG]
     
    echipvina, Aug 6, 2011 IP
  4. elixiusx

    elixiusx Peon

    Messages:
    65
    Likes Received:
    0
    Best Answers:
    1
    Trophy Points:
    0
    #4
    
    
    <?php
    set_time_limit(0);
    
    function getInfo($url)
    {
    	$useragent = "Mozilla/5.0";
    
    	$ch = curl_init($url);
    	curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
    	curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
    	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    
    	$result = curl_exec($ch);
    	curl_close($ch);
    
    	return $result;
    }
    
    function isPhone($phone)
    {
    	$phone = str_replace( array(' ', '_'), '', $phone );
    	if( strlen( $phone )==11 && substr($phone, 0, 2) == '07' && is_numeric( $phone ) )
    		return true;
    	return false;
    }
    
    function getList()
    {
    	$info = getInfo('http://www.gumtree.com/business-services');
    	preg_match_all('|href="http://www.gumtree.com/p/business-services/(.*)" name|', $info, $final);
    	$final = $final[1];
    	$matrix = array();
    	
    	foreach( $final as $item )
    	{
    		$item = explode('"', $item);
    		$matrix[] = 'http://www.gumtree.com/p/business-services/'.$item[0];
    	}
    	
    	return $matrix;
    }
    
    function getPhone($url)
    {
    	$info = getInfo($url);
    	preg_match('|<meta name="og:phone_number" content="(.*)"/>|', $info, $final);
    	$phone = $final[1];
    	
    	if( ereg('on', $phone) )
    	{
    		$phone =explode('on ', $phone);
    		$phone = $phone[1];
    	}
    	
    	if( empty($phone) )
    	{
    		preg_match('|<meta name="description" content="(.*)" />|msU', $info, $final);
    		$final = $final[1];
    		$pregs = array(
    			'|07[0-9]{9}|',
    			'|07[0-9]{3} [0-9]{5}|',
    			'|07[0-9]{3}-[0-9]{5}|',
    			'|07[0-9]{2} [0-9]{3} [0-9]{4}|',
    			'|07[0-9]{2}-[0-9]{3}-[0-9]{4}|',
    			'|07[0-9]{3} [0-9]{2} [0-9]{2} [0-9]{2}|',
    			'|07[0-9]{3}-[0-9]{2}-[0-9]{2}-[0-9]{2}|',
    			'|07[0-9]{3} [0-9]{3} [0-9]{3}|',
    			'|07[0-9]{3}-[0-9]{3}-[0-9]{3}|'
    		);	
    		
    		foreach( $pregs as $preg )
    		{
    			preg_match( $preg, $final, $finalx );
    			if( isPhone( $finalx[0] ) )
    			{
    				$phone = $finalx[0];
    				break;
    			}
    		}
    	}
    	
    	return $phone;
    }
    
    $list = getList();
    foreach( $list as $item )
    {
        $phone = getPhone($item);
    	
    	if( $phone > 0 )
    		print $phone."<br />\n";
    	
    	if( $x==30 ) break; # Remove this line to get all numbers
    	
    	$x++;
    }
    ?>
    
    
    PHP:
     
    elixiusx, Aug 10, 2011 IP