1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

PHP scraping lag...

Discussion in 'PHP' started by mark_s, Jun 20, 2007.

  1. exodus

    exodus Well-Known Member

    Messages:
    1,900
    Likes Received:
    35
    Best Answers:
    0
    Trophy Points:
    165
    #21
    Also, no mentioned using cUrl over file_get_contents. Which file_get_contents gets the information slower. (not noticeably if you run a cache system.)

    function getweb($url)
    {
          if (function_exists('curl_init')) {
              $ch = curl_init($url);
              curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
              curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
              return curl_exec($ch);
          } else {
              return file_get_contents($url);
          }
    }
    Code (markup):

     
    exodus, Jun 21, 2007 IP
  2. mark_s

    mark_s Peon

    Messages:
    497
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    0
    #22
    A friend pointed out that the rankings, depending on where the matches are being played, can sometimes be updated hours later than expected.

    Is it possible that the code could be made to check multiple times on Monday. So Monday 6AM, then Monday 8AM and then finally Monday 10AM?
     
    mark_s, Jun 22, 2007 IP
  3. PenSniffer

    PenSniffer Guest

    Messages:
    9
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #23
    in addition i would at that part of the page as an iframe if it is on top of for the times that you do load from the server.
     
    PenSniffer, Jun 22, 2007 IP
  4. mark_s

    mark_s Peon

    Messages:
    497
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    0
    #24
    That's one way to do it but I really don't like iframes so I won't do that.
     
    mark_s, Jun 22, 2007 IP
  5. mark_s

    mark_s Peon

    Messages:
    497
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    0
    #25
    I'm using the code and it's great, works fine.

    Can anyone look at this and tell me if it can be set so on Monday it checks at 1 hour intervals from a set time for several hours? So set the time to 6AM and check every hour after that till 11AM.

    <?
    /**
    * The number of days you wanna keep the cache on disk before retrieving a new copy from CACHE_URL
    **/ define( "CACHE_DAYS",         7 ) ;
    
    /**
    * The location on your server where you wanna keep the cache AND it's filename ( putting a . before it will keep it hidden )
    **/ define( "CACHE_LOCATION",      'rank_cache.disk' ) ;
    
    /**
    * If the file you're getting is particularly large, you might wanna squash it with gzcompression
    **/ define( "CACHE_GZ",    0 ) ;
    
    /**
    * The url to cache
    **/ define( "CACHE_URL",            'http://www.atptennis.com/3/en/players/playerprofiles/?playernumber=MC10' );
    
    /**
    * If you change ANY settings, you MUST set this to one to create a new valid cache
    **/ define( "CACHE_FORCE",      0 ) ;
    
    class cache
    {
        function cache( )
        {
            if( !file_exists( CACHE_LOCATION ) and !fopen( CACHE_LOCATION, 'w+' ) )
                die( sprintf( "Make sure the server has permission to write to '%s'", CACHE_LOCATION ) );         
            elseif( !$this->isvalid( ) or CACHE_FORCE )
                $this->retrieve( );
           
            if( !( $this->disk = file_get_contents( CACHE_LOCATION ) ) )
                die( sprintf( "Cannot read cache from '%s' into memory", CACHE_LOCATION ) );   
    
        }
        function time( )
        {
            return time( ) - ( CACHE_DAYS * 3600 * 24 );
        }
        function get( )
        {
            return CACHE_GZ ? gzuncompress( $this->disk ) : $this->disk ;
        }
        function isvalid( )
        {
            return ( filemtime( CACHE_LOCATION ) > $this->time( ) );
        }
        function retrieve( $size = 4096 )
        {
            if( !( $read = fopen( CACHE_URL, 'r' ) ) )
            {
                die( sprintf( "Cannot open '%s' for reading", CACHE_URL ) );   
            }
            elseif( !( $write = fopen( CACHE_LOCATION, 'w+' ) ) )
            {
                die( sprintf( "Cannot open '%s' for writing", CACHE_LOCATION ) );   
            }
            else
            {
                while( !feof( $read ) )
                {
                    $buffer[ ] = fgets( $read, $size );
                }
                if( !fwrite( $write, CACHE_GZ ? gzcompress( implode( null, $buffer ) ) : implode( null, $buffer ) ) )
                {
                    die( sprintf( "Cannot write to '%s'", CACHE_LOCATION ) );   
                }
                fclose( $read );
                fclose( $write );
                return true ;
            }
        }
    }
    /**
    * An example of how you might implement such a cache
    **/
    class scrape extends cache
    {
        /**
        * Construct scaper
        **/
        function scrape( )
        {
            /**
            * Construct the cache
            **/
            parent::cache(  );
        }
        /**
        * Operate on cache
        **/
        function operation( $regex, $index = null )
        {
            preg_match( $regex, parent::get( ), $matches );
            return $index ? $matches[ $index ] : $matches;
        }
    }
    /**
    * Create new scraper
    **/
    $scrape = new scrape( );
    /**
    * Operate on cache
    **/
    echo 'Entry: <span class="rank_green">';
    echo $scrape->operation( '/<div class="data">Current ATP Ranking - Singles:<\/div>\s*<\/td>\s*<td valign="middle"><span class="lines">\s*(\d+)\s*<\/span>/', 1 );
    echo '</span>&nbsp; ';
    
    echo 'Race: <span class="rank_green">';
    echo $scrape->operation( '/<div class="data">Current ATP Race - Singles:<\/div>\s*<\/td>\s*<td valign="middle"><span class="lines">\s*(\d+)\s*<\/span>/', 1 );
    echo '</span>';
    ?>
    
      
    Code (markup):
     
    mark_s, Jun 22, 2007 IP
  6. UnrealEd

    UnrealEd Peon

    Messages:
    148
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #26
    add these constants on top of your script:
    /**
    * Defines the start hour at which you want to update the cached file
    */ define('CACHE_HOUR_START', 6);
    
    /**
    * Defines the end hour at which you want to update the cached file
    */ define('CACHE_HOUR_END', 11);
    PHP:
    and replace the isvalid function with this one:
    function isvalid( )
        {
            return ( filemtime( CACHE_LOCATION ) > $this->time( ) && (date('H') >= CACHE_HOUR_START && date('H') <= CACHE_HOUR_END) );
        }
    PHP:
    i'm not 100% sure if it's gonna work, but i think it will. Just give it a try
     
    UnrealEd, Jun 23, 2007 IP
  7. mark_s

    mark_s Peon

    Messages:
    497
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    0
    #27
    Thanks so much UnrealEd :)

    1) The code you gave me is now forcing it to refresh every time.

    2) When the code works will that mean it only refreshs each hour or will it refresh on every visit in the given time frame? Obviously I hope it only does it each hour.
     
    mark_s, Jun 23, 2007 IP
  8. UnrealEd

    UnrealEd Peon

    Messages:
    148
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #28
    it will do it on each refresh when the hour is between 6 and 11. If you want it to refresh every hour, you're gonna have a problem because it's gonna be very, very hard to have a visitor exactly at 6:00 or 7:00, and so on till 11:00

    btw: i think i made a mistake in the isvalid function. It should be like this:
    function isvalid( )
        {
            return ( filemtime( CACHE_LOCATION ) > $this->time( ) || (date('H') >= CACHE_HOUR_START && date('H') <= CACHE_HOUR_END) );
        }
    PHP:
    otherwise it won't update the codeif the cache is older than 7 days, and the time is not between 6 and 11 AM
     
    UnrealEd, Jun 23, 2007 IP
  9. mark_s

    mark_s Peon

    Messages:
    497
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    0
    #29
    Why can't it be the same method of what the code initially did. I was told that it's the first visitor on or after that time that forces the cache. So why not have that for each hour?
     
    mark_s, Jun 23, 2007 IP
  10. UnrealEd

    UnrealEd Peon

    Messages:
    148
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #30
    The old method simply checks if the time the file was created (filemtime) was older then 3600*24*7 seconds (filemtime is on seconds). Meaning it will only check if the file hasn't been updated in 7 days. As soon as a visitor comes online on the 7th day, the cache will be updated, and starting from that moment till 7 days later the file will not be updated again.

    It never checked on what time the file was updated, so if you have a visitor who comes online at 12:00pm, and the file was created 8 days before it, at 11:59pm, it will update the file with the data of the previous week as the update on the website only happens between 6 and 11 am.

    Maybe this will work:
    function isvalid( )
        {
            return ( filemtime( CACHE_LOCATION ) > $this->time( ) || (date('H') >= CACHE_HOUR_START && date('H') <= CACHE_HOUR_END && filemtime( CACHE_LOCATION) >= time( ) - 3600) );
        }
    PHP:
    Again i'm not 100% sure it's gonna work. I don't have a test server here :(
    I think it will: now it will check if (the filemtime is older then 7 days) or (if the current hour is between 6 and 11 AM and the file was changed at least 1 hour ago)

    I just thought of something: filemtime returns the time on which the file was created, not the time when the file was changed. If your server is running on windows, that doesn't matter as filemtime returns the same as filectime (last change made to the file), but on a UNIX machine, this might give a problem. To solve it, replace every occurence of filemtime with filectime
     
    UnrealEd, Jun 23, 2007 IP
  11. mark_s

    mark_s Peon

    Messages:
    497
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    0
    #31
    Mine just a typical Linux VPS :-k

    Thanks for the new code, will be great if it works.
     
    mark_s, Jun 23, 2007 IP