Hi... I am beginner to use php. I want to scrap only particular information such as if i want to search about hotel means it should give list of hotel name, address,phone,email,fax... I already use the link www.bradino.com/php/php-screen-scraping/ . but it only used to retrive the data which is in table format. i want to extract the data which is designed with css. I want to extract the data from the www.local.ch
Hi friends, I was tried this code to extract the needed information from the web pages. I was used the following code <?php class ContentExtractor { var $container_tags = array( 'span','p','div' ); var $removed_tags = array( 'div id="hd"','meta','link','title','script','a href','img','ul','li','form','input','label','strong','href', 'noscript','iframe','h2','head','span class="iconKey"' ); var $ignore_len_tags = array( 'span' ); var $link_text_ratio = 0.04; var $min_text_len = 20; var $min_words = 0; var $total_links = 0; var $total_unlinked_words = 0; var $total_unlinked_text=''; var $text_blocks = 0; var $tree = null; var $unremoved=array(); function sanitize_text($text){ $text = str_ireplace(' ', ' ', $text); $text = html_entity_decode($text, ENT_QUOTES); $utf_spaces = array("\xC2\xA0", "\xE1\x9A\x80", "\xE2\x80\x83", "\xE2\x80\x82", "\xE2\x80\x84", "\xE2\x80\xAF", "\xA0"); $text = str_replace($utf_spaces, ' ', $text); return trim($text); } function extract($text, $ratio = null, $min_len = null){ $this->tree = new DOMDocument(); $start = microtime(true); if (!@$this->tree->loadHTML($text)) return false; $root = $this->tree->documentElement; $start = microtime(true); $this->HeuristicRemove($root, ( ($ratio == null) || ($min_len == null) )); if ($ratio == null) { $this->total_unlinked_text = $this->sanitize_text($this->total_unlinked_text); $words = preg_split('/[\s\r\n\t\|?!.,]+/', $this->total_unlinked_text); $words = array_filter($words); $this->total_unlinked_words = count($words); unset($words); if ($this->total_unlinked_words>0) { $this->link_text_ratio = $this->total_links / $this->total_unlinked_words;// + 0.01; $this->link_text_ratio *= 1.3; } } else { $this->link_text_ratio = $ratio; }; if ($min_len == null) { $this->min_text_len = strlen($this->total_unlinked_text)/$this->text_blocks; } else { $this->min_text_len = $min_len; } $start = microtime(true); $this->ContainerRemove($root); return $this->tree->saveHTML(); } function HeuristicRemove($node, $do_stats = false){ if (in_array($node->nodeName, $this->removed_tags)){ return true; }; if ($do_stats) { if ($node->nodeName == 'a') { $this->total_links++; } $found_text = false; }; $nodes_to_remove = array(); if ($node->hasChildNodes()){ foreach($node->childNodes as $child){ if ($this->HeuristicRemove($child, $do_stats)) { $nodes_to_remove[] = $child; } else if ( $do_stats && ($node->nodeName != 'a') && ($child->nodeName == '#text') ) { $this->total_unlinked_text .= $child->wholeText; if (!$found_text){ $this->text_blocks++; $found_text=true; } }; } foreach ($nodes_to_remove as $child){ $node->removeChild($child); } } return false; } function ContainerRemove($node){ if (is_null($node)) return 0; $link_cnt = 0; $word_cnt = 0; $text_len = 0; $delete = false; $my_text = ''; $ratio = 1; $nodes_to_remove = array(); if ($node->hasChildNodes()){ foreach($node->childNodes as $child){ $data = $this->ContainerRemove($child); if ($data['delete']) { $nodes_to_remove[]=$child; } else { $text_len += $data[2]; } $link_cnt += $data[0]; if ($child->nodeName == 'a') { $link_cnt++; } else { if ($child->nodeName == '#text') $my_text .= $child->wholeText; $word_cnt += $data[1]; } } foreach ($nodes_to_remove as $child){ $node->removeChild($child); } $my_text = $this->sanitize_text($my_text); $words = preg_split('/[\s\r\n\t\|?!.,\[\]]+/', $my_text); $words = array_filter($words); $word_cnt += count($words); $text_len += strlen($my_text); }; if (in_array($node->nodeName, $this->container_tags)){ if ($word_cnt>0) $ratio = $link_cnt/$word_cnt; if ($ratio > $this->link_text_ratio){ $delete = true; } if ( !in_array($node->nodeName, $this->ignore_len_tags) ) { if ( ($text_len < $this->min_text_len) || ($word_cnt<$this->min_words) ) { $delete = true; } } } return array($link_cnt, $word_cnt, $text_len, 'delete' => $delete); } } $html = file_get_contents('http://www.local.ch/en/q/bar.html'); $extractor = new ContentExtractor(); $content = $extractor->extract($html); echo $content; ?> PHP: But i was got the output as Results for in the current map areain The number of results indicates how many listings correspond to your search. To view these listings, click the Search button. Do you have any questions or suggestions? Or maybe even come across a problem? Please let us know: info@local.ch Results for You can choose if you want to print the map on this page, by using the options (such as "No Map") which appear directly above the map to display or hide it. The Yellow Pages > Bar, Restaurant Bleu Lézard rue Enning 10, 1003 Lausanne resultentry_06.63771746.520077Bleu Lézard Bleu Lézard rue Enning 10, 1003 Lausanne Tel.: * 021 321 38 30 tel/search The Yellow Pages > Bar, Restaurant, Events Nordportal Schmiedestrasse 12, 5400 Baden resultentry_18.30031447.481186Nordportal Nordportal Schmiedestrasse 12, 5400 Baden Tel.: * 056 221 15 72 tel/search ADN Bar Café rue de Lausanne 59, 1202 Genève resultentry_26.14646746.215079ADN Bar Café ADN Bar Café rue de Lausanne 59, 1202 Genève Tel.: * 022 731 40 18 tel/search Bar Abdelmajid Könizstrasse 3, 3008 Bern resultentry_37.42168146.944324Bar Abdelmajid Bar Abdelmajid Könizstrasse 3, 3008 Bern Tel.: 031 381 42 60 tel/search The Yellow Pages > Hotel, Bar, Restaurant Hotel SEDARTIS Bahnhofstrasse 16, 8800 Thalwil resultentry_48.56592247.295528Hotel SEDARTIS Hotel SEDARTIS Bahnhofstrasse 16, 8800 Thalwil Tel.: 043 388 33 00 tel/search The Yellow Pages > Club, Discotheque, Bar Liquid Genfergasse 10, 3011 Bern resultentry_57.44115946.949633Liquid Liquid Genfergasse 10, 3011 Bern Tel.: * 031 951 98 26 tel/search Bar Amalfi Spezialitäten aus dem Süden Turmstrasse 7, Zentrum Frohwies, 8330 Pfäffikon ZH resultentry_68.78198147.368167Bar Amalfi Bar Amalfi Turmstrasse 7, Zentrum Frohwies, 8330 Pfäffikon ZH Tel.: * 043 535 90 05 tel/search Bar Benjamin (-Gera) Im Allmendli 11, 8703 Erlenbach ZH resultentry_78.59953547.301096Bar Benjamin (-Gera) Bar Benjamin (-Gera) Im Allmendli 11, 8703 Erlenbach ZH Tel.: * 076 232 23 21 tel/search The Yellow Pages > Restaurant, Bar Bohemia Klosbachstrasse 2, 8032 Zürich resultentry_98.55496947.364845Bohemia Bohemia Klosbachstrasse 2, 8032 Zürich Tel.: 044 383 70 60 tel/search Help local.ch improve this page © 2010 local.ch ag © 2010 local.ch ag - Terms of use But i need only name of the bar, address , phone that should be stored in database. Please any one help to do this..
please help me I was tried another one method, but it can used to retrieve the single record alone. i was tried to use preg_match_all function also but i couldn't get the output. please anyone send the alternative code for following to get the needed information, <?php set_time_limit(360); function extract_unit($string, $start, $end) { $pos = stripos($string, $start); $str = substr($string, $pos); $str_two = substr($str, strlen($start)); $second_pos = stripos($str_two, $end); $str_three = substr($str_two, 0, $second_pos); $unit[] = trim($str_three); return $unit; } $text=file_get_contents("http://local.ch/en/q/bar.html"); $text1=extract_unit($text,'<div class="hidden">','</div>'); $unit[] = extract_unit($text, '<span class="head">', '</span>'); $unit[] = extract_unit($text,'<span class="street-address">','</span>'); $unit[] = extract_unit($text,'<span class="postal-code">','</span>'); $unit[] = extract_unit($text,'<span class="locality">','</span>'); $unit[] = extract_unit($text,'<span class="label">','</span>'); $unit[] = extract_unit($text,'<span class="tel">','</span>'); print_r($unit); ?> PHP: Thanks in advance... please please....
stil i didnt get any reply. I was tried in different method now i got the output as Bar Name ADN Bar Café Bar Abdelmajid Bar Amalfi Bar Benjamin (-Gera) Bar Bistro Amigos Bar Chez Franki Bar Croce d'Oro Bar Daniela (-Gera) Bar Golf Bar Gufo Address rue de Lausanne 59, 1202 Genève Könizstrasse 3, 3008 Bern Turmstrasse 7, 8330 Pfäffikon ZH Im Allmendli 11, 8703 Erlenbach ZH Bahnhofstrasse 2, 3360 Herzogenbuchsee rue Victor-Tissot 4, 1630 Bulle via Motta 3, 6900 Lugano Im Allmendli 11, 8703 Erlenbach ZH via della Posta 2, 6900 Lugano via Girella, 6814 Lamone Conduct 022 731 40 18 031 381 42 60 043 535 90 05 076 232 23 21 062 961 01 10 076 332 10 34 091 921 47 93 079 780 65 54 091 921 39 03 091 967 17 36 how can i save it in the mysql databse. please anyone leave suggestion for me. thanks in advance.