Building a web site crawler

Discussion in 'PHP' started by Personaltrainer, May 10, 2007.

  1. #1
    Hi,
    We are in the process of building a cutomised site crawlers. We are quiet successful in building one. But I have a question for the expert coders. Is it possible to fetch last modified data of a page from anywhere if so how is it done?
     
    Personaltrainer, May 10, 2007 IP
  2. markhere

    markhere Peon

    Messages:
    35
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #2
    Most hosts have last modified header enabled, you could check that one.
     
    markhere, May 10, 2007 IP
  3. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #3
    Assuming you have PHP 5:
    
    function filemtime_remote($url, $timestamp = false)
    {
    	foreach (get_headers($url) AS $header)
    	{
    		if (preg_match('/^Last-Modified:\s*(.+)/i', trim($header), $date))
    		{
    			return $timestamp ? strtotime($date[1]) : $date[1];
    		}
    	}
    	
    	return false;
    }
    
    PHP:
     
    nico_swd, May 10, 2007 IP
  4. Personaltrainer

    Personaltrainer Peon

    Messages:
    198
    Likes Received:
    3
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Thank you Nico. It is working fine with html sites and the images, but not for php pages. Is there any thing else I could try?
     
    Personaltrainer, May 10, 2007 IP
  5. nico_swd

    nico_swd Prominent Member

    Messages:
    4,153
    Likes Received:
    344
    Best Answers:
    18
    Trophy Points:
    375
    #5
    Not sure. Maybe there are no "last modified" headers for PHP pages because they are supposed to be dynamic, and their last modification was when they've been last opened. Not sure if there's a way to achieve this...

    EDIT:

    What you could do is save an md5 hash of the source code in a database, and compare it to previous saved hashes. If it's different, then you know more or less when it's been modified for the last time.
     
    nico_swd, May 10, 2007 IP