scraping needed data from website and store in database using curl php

Discussion in 'Google' started by Nagadurga, Jul 21, 2010.

  1. #1
    Hi...
    I am beginner to use php. I want to scrap only particular information such as if i want to search about hotel means it should give list of hotel name, address,phone,email,fax...
    I already use the link www.bradino.com/php/php-screen-scraping/ . but it only used to retrive the data which is in table format. i want to extract the data which is designed with css. I want to extract the data from the www.local.ch
     
    Nagadurga, Jul 21, 2010 IP
  2. Nagadurga

    Nagadurga Peon

    Messages:
    7
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Hi friends,
    I was tried this code to extract the needed information from the web pages. I was used the following code
    <?php
    class ContentExtractor {
     
    	var $container_tags = array(
    			'span','p','div'
    		);
    	var $removed_tags = array(
    			 'div id="hd"','meta','link','title','script','a href','img','ul','li','form','input','label','strong','href',
    			 'noscript','iframe','h2','head','span class="iconKey"'
    		);
    	var $ignore_len_tags = array(
    			'span'
    		);	
     
    	var $link_text_ratio = 0.04;
    	var $min_text_len = 20;
    	var $min_words = 0;	
     
    	var $total_links = 0;
    	var $total_unlinked_words = 0;
    	var $total_unlinked_text='';
    	var $text_blocks = 0;
     
    	var $tree = null;
    	var $unremoved=array();
     
    	function sanitize_text($text){
    		$text = str_ireplace('&nbsp;', ' ', $text);
    		$text = html_entity_decode($text, ENT_QUOTES);
     
    		$utf_spaces = array("\xC2\xA0", "\xE1\x9A\x80", "\xE2\x80\x83", 
    			"\xE2\x80\x82", "\xE2\x80\x84", "\xE2\x80\xAF", "\xA0");
    		$text = str_replace($utf_spaces, ' ', $text);
     
    		return trim($text);
    	}
     
    	function extract($text, $ratio = null, $min_len = null){
    		$this->tree = new DOMDocument();
     
    		$start = microtime(true);
    		if (!@$this->tree->loadHTML($text)) return false;
     
    		$root = $this->tree->documentElement;
    		$start = microtime(true);
    		$this->HeuristicRemove($root, ( ($ratio == null) || ($min_len == null) ));
     
    		if ($ratio == null) {
    			$this->total_unlinked_text = $this->sanitize_text($this->total_unlinked_text);
     
    			$words = preg_split('/[\s\r\n\t\|?!.,]+/', $this->total_unlinked_text);
    			$words = array_filter($words);
    			$this->total_unlinked_words = count($words);
    			unset($words);
    			if ($this->total_unlinked_words>0) {
    				$this->link_text_ratio = $this->total_links / $this->total_unlinked_words;// + 0.01;
    				$this->link_text_ratio *= 1.3;
    			}
     
    		} else {
    			$this->link_text_ratio = $ratio;
    		};
     
    		if ($min_len == null) {
    			$this->min_text_len = strlen($this->total_unlinked_text)/$this->text_blocks;
    		} else {
    			$this->min_text_len = $min_len;
    		}
     
    		$start = microtime(true);
    		$this->ContainerRemove($root);
     
    		return $this->tree->saveHTML();
    	}
     
    	function HeuristicRemove($node, $do_stats = false){
    		if (in_array($node->nodeName, $this->removed_tags)){
    			return true;
    		};
     
    		if ($do_stats) {
    			if ($node->nodeName == 'a') {
    				$this->total_links++;
    			}
    			$found_text = false;
    		};
     
    		$nodes_to_remove = array();
     
    		if ($node->hasChildNodes()){
    			foreach($node->childNodes as $child){
    				if ($this->HeuristicRemove($child, $do_stats)) {
    					$nodes_to_remove[] = $child;
    				} else if ( $do_stats && ($node->nodeName != 'a') && ($child->nodeName == '#text') ) {
    					$this->total_unlinked_text .= $child->wholeText;
    					if (!$found_text){
    						$this->text_blocks++;
    						$found_text=true;
    					}
    				};
    			}
    			foreach ($nodes_to_remove as $child){
    				$node->removeChild($child);
    			}
    		}
     
    		return false;
    	}
     
    	function ContainerRemove($node){
    		if (is_null($node)) return 0;
    		$link_cnt = 0;
    		$word_cnt = 0;
    		$text_len = 0;
    		$delete = false;
    		$my_text = '';
     
    		$ratio = 1;
     
    		$nodes_to_remove = array();
    		if ($node->hasChildNodes()){
    			foreach($node->childNodes as $child){
    				$data = $this->ContainerRemove($child);
     
    				if ($data['delete']) {
    					$nodes_to_remove[]=$child;
    				} else {
    					$text_len += $data[2];
    				}
     
    				$link_cnt += $data[0];
     
    				if ($child->nodeName == 'a') {
    					$link_cnt++;
    				} else {
    					if ($child->nodeName == '#text') $my_text .= $child->wholeText;
    					$word_cnt += $data[1];
    				}
    			}
     
    			foreach ($nodes_to_remove as $child){
    				$node->removeChild($child);
    			}
     
    			$my_text = $this->sanitize_text($my_text);
     
    			$words = preg_split('/[\s\r\n\t\|?!.,\[\]]+/', $my_text);
    			$words = array_filter($words);
     
    			$word_cnt += count($words);
    			$text_len += strlen($my_text);
     
    		};
     
    		if (in_array($node->nodeName, $this->container_tags)){
    			if ($word_cnt>0) $ratio = $link_cnt/$word_cnt;
     
    			if ($ratio > $this->link_text_ratio){
    					$delete = true;
    			}
     
    			if ( !in_array($node->nodeName, $this->ignore_len_tags) ) {
    				if ( ($text_len < $this->min_text_len) || ($word_cnt<$this->min_words) ) {
    					$delete = true;
    				}
    			}
     
    		}	
     
    		return array($link_cnt, $word_cnt, $text_len, 'delete' => $delete);
    	}
     
    }
     
    
    $html = file_get_contents('http://www.local.ch/en/q/bar.html');
     
    $extractor = new ContentExtractor();
    $content = $extractor->extract($html); 
    echo $content;
    ?>
    
    
    PHP:
    But i was got the output as
    Results for in the current map areain
    The number of results indicates how many listings correspond to your search.
    To view these listings, click the Search button. Do you have any questions or
    suggestions? Or maybe even come across a problem? Please let us know:
    info@local.ch
    Results for
    You can choose if you want to print the map on this page,
    by using the options (such as "No Map") which appear directly
    above the map to display or hide it.

    The Yellow Pages > Bar, Restaurant
    Bleu Lézard

    rue Enning 10, 1003 Lausanne
    resultentry_06.63771746.520077Bleu Lézard
    Bleu Lézard

    rue Enning 10, 1003 Lausanne

    Tel.: * 021 321 38 30
    tel/search

    The Yellow Pages > Bar, Restaurant, Events
    Nordportal

    Schmiedestrasse 12, 5400 Baden
    resultentry_18.30031447.481186Nordportal
    Nordportal

    Schmiedestrasse 12, 5400 Baden

    Tel.: * 056 221 15 72
    tel/search
    ADN Bar Café

    rue de Lausanne 59, 1202 Genève
    resultentry_26.14646746.215079ADN Bar Café
    ADN Bar Café

    rue de Lausanne 59, 1202 Genève

    Tel.: * 022 731 40 18
    tel/search
    Bar Abdelmajid

    Könizstrasse 3, 3008 Bern
    resultentry_37.42168146.944324Bar Abdelmajid
    Bar Abdelmajid

    Könizstrasse 3, 3008 Bern

    Tel.: 031 381 42 60
    tel/search

    The Yellow Pages > Hotel, Bar, Restaurant
    Hotel SEDARTIS

    Bahnhofstrasse 16, 8800 Thalwil
    resultentry_48.56592247.295528Hotel SEDARTIS
    Hotel SEDARTIS

    Bahnhofstrasse 16, 8800 Thalwil

    Tel.: 043 388 33 00
    tel/search

    The Yellow Pages > Club, Discotheque, Bar
    Liquid

    Genfergasse 10, 3011 Bern
    resultentry_57.44115946.949633Liquid
    Liquid

    Genfergasse 10, 3011 Bern

    Tel.: * 031 951 98 26
    tel/search
    Bar Amalfi

    Spezialitäten aus dem Süden

    Turmstrasse 7, Zentrum Frohwies, 8330 Pfäffikon ZH
    resultentry_68.78198147.368167Bar Amalfi
    Bar Amalfi

    Turmstrasse 7, Zentrum Frohwies, 8330 Pfäffikon ZH

    Tel.: * 043 535 90 05
    tel/search
    Bar Benjamin (-Gera)

    Im Allmendli 11, 8703 Erlenbach ZH
    resultentry_78.59953547.301096Bar Benjamin (-Gera)
    Bar Benjamin (-Gera)

    Im Allmendli 11, 8703 Erlenbach ZH

    Tel.: * 076 232 23 21
    tel/search

    The Yellow Pages > Restaurant, Bar
    Bohemia

    Klosbachstrasse 2, 8032 Zürich
    resultentry_98.55496947.364845Bohemia
    Bohemia

    Klosbachstrasse 2, 8032 Zürich

    Tel.: 044 383 70 60
    tel/search

    Help local.ch improve this page
    © 2010 local.ch ag
    © 2010 local.ch ag - Terms of use

    But i need only name of the bar, address , phone that should be stored in database.
    Please any one help to do this..
     
    Nagadurga, Jul 26, 2010 IP
  3. Nagadurga

    Nagadurga Peon

    Messages:
    7
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    please help me
    I was tried another one method, but it can used to retrieve the single record alone. i was tried to use preg_match_all function also but i couldn't get the output. please anyone send the alternative code for following to get the needed information,

    <?php
    
    set_time_limit(360);
    function extract_unit($string, $start, $end)
    {
    $pos = stripos($string, $start);
    
    $str = substr($string, $pos);
    
    $str_two = substr($str, strlen($start));
    
    $second_pos = stripos($str_two, $end);
    
    $str_three = substr($str_two, 0, $second_pos);
    
    $unit[] = trim($str_three); 
    
    return $unit;
    }
    $text=file_get_contents("http://local.ch/en/q/bar.html");
    $text1=extract_unit($text,'<div class="hidden">','</div>');
    $unit[] = extract_unit($text, '<span class="head">', '</span>');
    $unit[] = extract_unit($text,'<span class="street-address">','</span>');
    $unit[] = extract_unit($text,'<span class="postal-code">','</span>');
    $unit[] = extract_unit($text,'<span class="locality">','</span>');
    $unit[] = extract_unit($text,'<span class="label">','</span>');
    $unit[] = extract_unit($text,'<span class="tel">','</span>');
    print_r($unit);
    
    ?>
    
    PHP:
    Thanks in advance...
    please please....
     
    Nagadurga, Jul 27, 2010 IP
  4. Nagadurga

    Nagadurga Peon

    Messages:
    7
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #4
    stil i didnt get any reply.
    I was tried in different method now i got the output as

    Bar Name
    ADN Bar Café
    Bar Abdelmajid
    Bar Amalfi
    Bar Benjamin (-Gera)
    Bar Bistro Amigos
    Bar Chez Franki
    Bar Croce d'Oro
    Bar Daniela (-Gera)
    Bar Golf
    Bar Gufo

    Address
    rue de Lausanne 59, 1202 Genève
    Könizstrasse 3, 3008 Bern
    Turmstrasse 7, 8330 Pfäffikon ZH
    Im Allmendli 11, 8703 Erlenbach ZH
    Bahnhofstrasse 2, 3360 Herzogenbuchsee
    rue Victor-Tissot 4, 1630 Bulle
    via Motta 3, 6900 Lugano
    Im Allmendli 11, 8703 Erlenbach ZH
    via della Posta 2, 6900 Lugano
    via Girella, 6814 Lamone

    Conduct
    022 731 40 18
    031 381 42 60
    043 535 90 05
    076 232 23 21
    062 961 01 10
    076 332 10 34
    091 921 47 93
    079 780 65 54
    091 921 39 03
    091 967 17 36


    how can i save it in the mysql databse.
    please anyone leave suggestion for me.
    thanks in advance.
     
    Nagadurga, Jul 29, 2010 IP