about web crawler in php

Discussion in 'PHP' started by shilpi, Apr 15, 2010.

  1. #1
    hii all,

    i am implementing a web crawler using php...i have already crawled URLs, and now i have to extract keywords from pages of corresponding URLs and save them into my database.
    so to extract keywords,i want to remove all css code,javascript and urls from my page and want to save rest information as tokens in my database.
    can anybody help me or give me such script to do this task...

    i am running short of time for submission of my project,so i will be very thankful if anybody cpuld help me soon...
    thanks a lot..
     
    shilpi, Apr 15, 2010 IP
  2. skywebsol

    skywebsol Well-Known Member

    Messages:
    161
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    103
    #2
    $url = 'http://www.website.com';
    $tele = curl_init();
    curl_setopt($tele,CURLOPT_HEADER,0);
    curl_setopt($tele,CURLOPT_RETURNTRANSFER,1);
    curl_setopt($tele,CURLOPT_URL,$url);
    $response = curl_exec($tele); // here you get all html code

    then search for html tag in code like below

    $offset=strpos($response, '<div class="title">', $offset);
    $offset=strpos($response, '<h3>', $offset);
    $start=strpos($response, 'href="', $offset)+strlen('href="');
    $end=strpos($response, '">', $start);
    $link=substr($response, $start, $end-$start);
    $link=strip_tags($link); // here you get your value or (keyword) then store this in your database

    if you need more then PM me ok
     
    skywebsol, Apr 15, 2010 IP
  3. georgiivanov

    georgiivanov Member

    Messages:
    62
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    25
    #3
    georgiivanov, Apr 15, 2010 IP
  4. ps3ubo

    ps3ubo Peon

    Messages:
    204
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #4
    ps3ubo, Apr 15, 2010 IP
  5. shilpi

    shilpi Peon

    Messages:
    4
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #5
    hii thanx all for replies,but i m still in trouble with this problem...

    when i am using strip-tags function from "http://php.net/manual/en/function.strip-tags.php ",they are perfectly working for a small html code,but when i am including a web page as,suppose
    $s=file_get_contents("http://www.google.co.in") then its giving most part of css code of that page...
     
    shilpi, Apr 16, 2010 IP
  6. shilpi

    shilpi Peon

    Messages:
    4
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #6
    hii,

    thank you for the reply,but the problem is when i am including my web page,say,
    $url = 'http://images.google.co.in/imghp?hl=en&tab=wi'; then most part of output is css code,and i have not found the keywords there which specifies "what that web page is all about"
    so,saving the css code or java script into database is just wasting of space...
    i want to extract the keywords which convey information of that web page,not design part or coding part.

    please correct if i am wrong in my concept or not using this code properly as i am working on php for first time...
    thanks a lot..
     
    shilpi, Apr 16, 2010 IP
  7. shilpi

    shilpi Peon

    Messages:
    4
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #7
    could anybody please help me...
     
    shilpi, Apr 17, 2010 IP
  8. frank100

    frank100 Peon

    Messages:
    65
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #8
    there is one book written in whole tropic...
    webcrawler and .....

    get it
     
    frank100, Apr 17, 2010 IP
  9. desiattitude

    desiattitude Peon

    Messages:
    11
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #9
    hmmm... true true
     
    desiattitude, Apr 18, 2010 IP
  10. WebtoolMaster

    WebtoolMaster Active Member

    Messages:
    365
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    53
    #10
    Code worked...
     
    WebtoolMaster, Jun 5, 2010 IP
  11. roopajyothi

    roopajyothi Active Member

    Messages:
    1,302
    Likes Received:
    11
    Best Answers:
    0
    Trophy Points:
    80
    #11
    You use Strip tags to remove that
    Else DOM to extract only needed data!
     
    roopajyothi, Jun 5, 2010 IP