Help with scraper for spider

Discussion in 'PHP' started by Joeker, Aug 21, 2008.

  1. #1
    I'm working on a simple search engine, but I am having a problem with the scraper I wrote.

    All I want to do is get the page using file_get_contents and insert the title content into my database.

    The problem is, since its scraping so many pages the script is timing out or not scraping properly.

    Heres my code:

    
    <?php
    
    set_time_limit(0);
    
    $dbhost = "localhost"; 
    $dbuser = "*****";
    $dbpass = "*****";
    $dbname = "*****";
    
    mysql_connect($dbhost, $dbuser, $dbpass);
    mysql_select_db($dbname);
    
    function get ($a,$b,$c)
    {
    $y = explode($b,$a);
    $x = explode($c,$y[1]);
    
    return $x[0];
    }
    
    for ($i = 1; $i <= 1000000; $i++) 
    {
    $content = file_get_contents("http://www.website.com/page.php?id=$i");
    
    $title = get($content, "<title>", "</title>");
    
    mysql_query("INSERT INTO spider (title) VALUES ('$title')");
    }
    
    ?>
    
    Code (markup):
    Anyone?
     
    Joeker, Aug 21, 2008 IP
  2. Louis11

    Louis11 Active Member

    Messages:
    783
    Likes Received:
    26
    Best Answers:
    0
    Trophy Points:
    70
    #2
    The runtime on that is going to be nuts :p

    In either case, why not use a regular expression . . . off the top of my head something like:

    
    $title = preg_match("/<title>.*<\/title>/", $content, $matches);
    
    PHP:
    I'm sure the regex could be improved but you get the idea.

    Also, what do you mean when you say "Scrapper just isn't working" . . . If you mean it's not grabbing the title, i'm going to be it's that get function you wrote.

    You may also consider upping the script time limit . . . looping through that many times then calling another function is gonna take a while.
     
    Louis11, Aug 21, 2008 IP
  3. Freewebspace

    Freewebspace Notable Member

    Messages:
    6,213
    Likes Received:
    370
    Best Answers:
    0
    Trophy Points:
    275
    #3
    Whether this holds good for this

    <TITLE> </TITLE> also?
     
    Freewebspace, Aug 21, 2008 IP
  4. live-cms_com

    live-cms_com Notable Member

    Messages:
    3,128
    Likes Received:
    112
    Best Answers:
    0
    Trophy Points:
    205
    Digital Goods:
    1
    #4
    What I do is have the PHP file do a header redirect to itself so you don't get a server time out. If you do this though, you have to disable Firefox's redirect loop security in about:config.
     
    live-cms_com, Aug 21, 2008 IP
  5. Louis11

    Louis11 Active Member

    Messages:
    783
    Likes Received:
    26
    Best Answers:
    0
    Trophy Points:
    70
    #5
    I'm a little late, but there is a case sensitive delimiter in regular expressions. If my memory doesn't fail me i think it's just an "i". But I do believe the code should work as provided.
     
    Louis11, Aug 26, 2008 IP