hii all, i am implementing a web crawler using php...i have already crawled URLs, and now i have to extract keywords from pages of corresponding URLs and save them into my database. so to extract keywords,i want to remove all css code,javascript and urls from my page and want to save rest information as tokens in my database. can anybody help me or give me such script to do this task... i am running short of time for submission of my project,so i will be very thankful if anybody cpuld help me soon... thanks a lot..
$url = 'http://www.website.com'; $tele = curl_init(); curl_setopt($tele,CURLOPT_HEADER,0); curl_setopt($tele,CURLOPT_RETURNTRANSFER,1); curl_setopt($tele,CURLOPT_URL,$url); $response = curl_exec($tele); // here you get all html code then search for html tag in code like below $offset=strpos($response, '<div class="title">', $offset); $offset=strpos($response, '<h3>', $offset); $start=strpos($response, 'href="', $offset)+strlen('href="'); $end=strpos($response, '">', $start); $link=substr($response, $start, $end-$start); $link=strip_tags($link); // here you get your value or (keyword) then store this in your database if you need more then PM me ok
i use php web crawlers for my site http://www.mp3drug.com/ its very easy to do once you get the hang of it
hii thanx all for replies,but i m still in trouble with this problem... when i am using strip-tags function from "http://php.net/manual/en/function.strip-tags.php ",they are perfectly working for a small html code,but when i am including a web page as,suppose $s=file_get_contents("http://www.google.co.in") then its giving most part of css code of that page...
hii, thank you for the reply,but the problem is when i am including my web page,say, $url = 'http://images.google.co.in/imghp?hl=en&tab=wi'; then most part of output is css code,and i have not found the keywords there which specifies "what that web page is all about" so,saving the css code or java script into database is just wasting of space... i want to extract the keywords which convey information of that web page,not design part or coding part. please correct if i am wrong in my concept or not using this code properly as i am working on php for first time... thanks a lot..