1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

PHP scraping lag...

Discussion in 'PHP' started by mark_s, Jun 20, 2007.

  1. #1
    I have some PHP scraping on my website that grabs the latest ranking of a tennis player however half the time my site is 3x slower in loading due to the code.

    Is there a way I can avoid this lag? Do I need some advanced cache code or should I use a database to resolve this issue?

    The code I'm using:

    <?php
    $data = file_get_contents('http://www.atptennis.com/3/en/players/playerprofiles/?playernumber=MC10');
    $regex = '/<div class="data">Current ATP Ranking - Singles:<\/div>\s*<\/td>\s*<td valign="middle"><span class="lines">\s*(\d+)\s*<\/span>/';
    preg_match($regex,$data,$entry);
    echo $entry[1];
    ?>
    
    Code (markup):
     
    mark_s, Jun 20, 2007 IP
  2. jazz7620

    jazz7620 Banned

    Messages:
    357
    Likes Received:
    12
    Best Answers:
    0
    Trophy Points:
    0
    #2
    You shoudl seriously consider caching this. why? Everytime you load the page you are hitting atptennis.com server. Their webmaster may get annoyed specially if you have a lots of visitors.

    You can try to setup a cronjob and make this code run once a day or twice a day, store the result in db locally and display from there.
     
    jazz7620, Jun 20, 2007 IP
  3. UnrealEd

    UnrealEd Peon

    Messages:
    148
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #3
    You don't even need a cronjob. I assume this site isn't updated every hour, so you can easily store a timestamp in your database (together with the wanted content), and then check in your script if a certain amount of time (let's say a day) has been exceeded, and then grab the contents again from the webpage.
     
    UnrealEd, Jun 20, 2007 IP
  4. mark_s

    mark_s Peon

    Messages:
    497
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    0
    #4
    Is there any chance you could give me the code for such a function? Or if that will take too much of your time, maybe some sort of tutorial?

    In fact the way the php scrape could work is for it to only grab the data at a certain time and day in the week. The rankings are updated every Monday morning.

    Thanks for the info, much appreciated.
     
    mark_s, Jun 20, 2007 IP
  5. UnrealEd

    UnrealEd Peon

    Messages:
    148
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #5
    sure np:

    Is the ranking the only thing you want to grab from the website? or do you want some other stuff as well? It's important cause you can speed up things if you only need the ranking, and some other things.
    For now i'm gonna assume you only need the ranking.

    What you need to do first is create a new table in your database, named atp_rankings (or something similar). This table should at least have 3 fields: 1 which will contain the name of the player, 1 which will contain the last_updated timestamp, and 1 which will contain the ranking of the player:
    CREATE TABLE `atp_rankings` (
      `player` VARCHAR(255) NOT NULL,
      `date` TIMESTAMP NOT NULL default CURRENT_TIMSTAMP,
      `ranking` INT(4) NOT NULL
    );
    Code (markup):
    Whenever a visitor wants to see the ranking of a certain player, you will have to check what the last time was you updated the ranking of the player. Since the site is updated every monday, you first need to check in php if the current day is monday, and then see if the date in the database is different than the one now (sounds crazy :), i'll post some code a little further). If there's a different date, which means you have old data in your database, you need to fetch the webpage. After you fetched the data from the webpage, you use your regex to get the ranking, and this is what you should store in your database.

    And now in code format:
    $player = "Andy Murray";
    
    $query = "SELECT UNIX_TIMESTAMP(date) AS date FROM atp_rankings WHERE player='" . $player . "'";
    $day = date("N");
    
    if ($day == 1) { // 1 == Monday, check out www.php.net
      $result = mysql_query($query) or die("Database Error: " . mysql_error());
      $db_date = mysql_result($result, 0, "date"); // get the date from the resultset
      if (date("Y-m-d") > date("Y-m-d", $db_date)) {
        // get the ranking here
        $query = "UPDATE atp_rankings SET ranking=" . $ranking . ", date=NOW() WHERE player='" . $player . "'";
        mysql_query($query) or die("Database Error: " . mysql_error());
      }
    }
    
    // now grab the content from the database again to display it to the user
    PHP:
    I hope i helped you a little further :)
    I didn't test the code, and i wrote everything on the spot, so there will probably be some errors

    If you want to display some more information to the user from the webpage, you just have to create some additional fields in the table. If it's a lot you want to display, maybe it would be best if you store the <div> container with all information in 1 field, and use a regex to get the information you want to display everytime you display it the visitor
     
    UnrealEd, Jun 20, 2007 IP
  6. mark_s

    mark_s Peon

    Messages:
    497
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    0
    #6
    Wow! Thank you so much for all the code.

    I get this error in phpmyadmin when creating the table:

    #1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'CURRENT_TIMSTAMP,
      `ranking` INT(4) NOT NULL
    )' at line 3
    
    Code (markup):
     
    mark_s, Jun 20, 2007 IP
  7. UnrealEd

    UnrealEd Peon

    Messages:
    148
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #7
    a typo: it should be CURRENT_TIMESTAMP instead of CURRENT_TIMSTAMP
     
    UnrealEd, Jun 20, 2007 IP
  8. mark_s

    mark_s Peon

    Messages:
    497
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    0
    #8
    Thanks :)

    So I've created the table > put the PHP into rankingscrape.php > included the php on my website.

    How do I now make it display the ranking? And how do I make it so it only checks ATP.com on a Monday at 6AM GMT?
     
    mark_s, Jun 20, 2007 IP
  9. Eran-s

    Eran-s Peon

    Messages:
    50
    Likes Received:
    1
    Best Answers:
    0
    Trophy Points:
    0
    #9
    To display the data, make a select query and print the mysql_fetch_array content...
     
    Eran-s, Jun 20, 2007 IP
  10. UnrealEd

    UnrealEd Peon

    Messages:
    148
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #10
    to display the ranking, just use a mysql_query and select the data according to the players name:
    $query = "SELECT ranking FROM atp_rankings WHERE player='" . $player . "'";
    $result = mysql_query($query) or die("Database Error: " . mysql_error());
    while ($row = mysql_fetch_assoc($result)) {
      echo $row['ranking'];
    }
    PHP:
    This script allready updates on mondays only. I don't think it's wise to only update at 6AM, because then the rankings will only be updated when a visitor is there exactly between 6:00 and 6:59 AM, otherwise the data isn't updated.

    I just thought of something else: suppose no one visits your website for 9 days, so skipping a monday. This means that your data will be 2 weeks old before you update. That's why it's best to add another check: see if the difference in days between the current date and the date in the database is larger than 7. If so, you need to update anyway. I think this should do the trick:
    $player = "Andy Murray";
    
    $query = "SELECT UNIX_TIMESTAMP(date) AS date FROM atp_rankings WHERE player='" . $player . "'";
    $result = mysql_query($query) or die("Database Error: " . mysql_error());
    $db_date = mysql_result($result, 0, "date"); // get the date from the resultset
    $day = date("N");
    
    if ($day == 1 || (date("z") - date("z", $db_date)) > 7) { // 1 == Monday, check out www.php.net
      if (date("Y-m-d") > date("Y-m-d", $db_date)) {
        // get the ranking here
        $query = "UPDATE atp_rankings SET ranking=" . $ranking . ", date=NOW() WHERE player='" . $player . "'";
        mysql_query($query) or die("Database Error: " . mysql_error());
      }
    }
    PHP:
    I'm not sure, but the additional condition might cause some problems when the year changes, i'd have to check, but now i need to get some sleep :)
     
    UnrealEd, Jun 20, 2007 IP
  11. mark_s

    mark_s Peon

    Messages:
    497
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    0
    #11
    Thanks for the code... just have a few things...

    1) You say it checks on Monday... does that mean it will constantly pull the data from the external site on every hit throughout that day? Or only once?

    2) Can I have the code that connects people to the database?

    3) How is that code going to work when the actual PHP scrape is no where there?
     
    mark_s, Jun 20, 2007 IP
  12. UnrealEd

    UnrealEd Peon

    Messages:
    148
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #12
    No: Within the if that checks if today s monday, there's another one, which checks if the date in the database is older than today's date:
    if (date("Y-m-d") > date("Y-m-d", $db_date)) {
    PHP:
    So the script will only update once on monday, as i update the the date in the database with the current date whenever that if condition is true

    $con = mysql_connect($host, $username, $password) or die("Could not connect to the database: " . mysql_error());
    mysql_select_db($database, $con) or die("Could Not Find Database");
    PHP:
    Forgot all about that part :rolleyes:
    What you need to do is see if there's a player named $player in the database, if not, you need to grab the data anyway. You will have to INSERT the data into the database instead of UPDATE-ing it. It's really not that difficult to write. The only function you'll be needing extra is: mysql_num_rows, and you will have to add another if
     
    UnrealEd, Jun 21, 2007 IP
  13. krakjoe

    krakjoe Well-Known Member

    Messages:
    1,795
    Likes Received:
    141
    Best Answers:
    0
    Trophy Points:
    135
    #13
    IMO using mysql in this case is over the top, caching the file on disk should be enough, regex matching doens't take any time at all ....

    
    <?
    /**
    * The number of days you wanna keep the cache on disk before retrieving a new copy from CACHE_URL
    **/ define( "CACHE_DAYS",			1 ) ;
    
    /**
    * The location on your server where you wanna keep the cache AND it's filename ( putting a . before it will keep it hidden )
    **/ define( "CACHE_LOCATION",		'.cache.disk' ) ;
    
    /**
    * If the file you're getting is particularly large, you might wanna squash it with gzcompression
    **/ define( "CACHE_GZ",				0 ) ;
    
    /**
    * The url to cache
    **/ define( "CACHE_URL",			'http://www.atptennis.com/3/en/players/playerprofiles/?playernumber=MC10' );
    
    /**
    * If you change ANY settings, you MUST set this to one to create a new valid cache
    **/ define( "CACHE_FORCE",			0 ) ;
    
    class cache
    {
    	function cache( )
    	{
    		if( !file_exists( CACHE_LOCATION ) and !fopen( CACHE_LOCATION, 'w+' ) )
    			die( sprintf( "Make sure the server has permission to write to '%s'", CACHE_LOCATION ) );			
    		elseif( !$this->isvalid( ) or CACHE_FORCE )
    			$this->retrieve( );
    		
    		if( !( $this->disk = file_get_contents( CACHE_LOCATION ) ) )
    			die( sprintf( "Cannot read cache from '%s' into memory", CACHE_LOCATION ) );	
    
    	}
    	function time( )
    	{
    		return time( ) - ( CACHE_DAYS * 3600 * 24 );
    	}
    	function get( )
    	{
    		return CACHE_GZ ? gzuncompress( $this->disk ) : $this->disk ;
    	}
    	function isvalid( )
    	{
    		return ( filemtime( CACHE_LOCATION ) > $this->time( ) );
    	}
    	function retrieve( $size = 4096 )
    	{
    		if( !( $read = fopen( CACHE_URL, 'r' ) ) )
    		{
    			die( sprintf( "Cannot open '%s' for reading", CACHE_URL ) );	
    		}
    		elseif( !( $write = fopen( CACHE_LOCATION, 'w+' ) ) )
    		{
    			die( sprintf( "Cannot open '%s' for writing", CACHE_LOCATION ) );	
    		}
    		else
    		{
    			while( !feof( $read ) )
    			{
    				$buffer[ ] = fgets( $read, $size );
    			}
    			if( !fwrite( $write, CACHE_GZ ? gzcompress( implode( null, $buffer ) ) : implode( null, $buffer ) ) )
    			{
    				die( sprintf( "Cannot write to '%s'", CACHE_LOCATION ) );	
    			}
    			fclose( $read );
    			fclose( $write );
    			return true ;
    		}
    	}
    }
    /**
    * An example of how you might implement such a cache
    **/
    class scrape extends cache
    {
    	/**
    	* Construct scaper
    	**/
    	function scrape( )
    	{
    		/**
    		* Construct the cache
    		**/
    		parent::cache(  );
    	}
    	/**
    	* Operate on cache
    	**/
    	function operation( $regex, $index = null )
    	{
    		preg_match( $regex, parent::get( ), $matches );
    		return $index ? $matches[ $index ] : $matches;
    	}
    }
    /**
    * Create new scraper
    **/
    $scrape = new scrape( );
    /**
    * Operate on cache
    **/
    echo $scrape->operation( '/<div class="data">Current ATP Ranking - Singles:<\/div>\s*<\/td>\s*<td valign="middle"><span class="lines">\s*(\d+)\s*<\/span>/', 1 );
    ?>
    
    PHP:
     
    krakjoe, Jun 21, 2007 IP
  14. mark_s

    mark_s Peon

    Messages:
    497
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    0
    #14
    Thanks guys...

    This is all a bit out of my understanding here.

    krakjoe, thanks for your input here... I get this message: CACHE Cannot read cache from 'rank_cache.disk' into memory

    Also, is there a way to get the cache cleared every Monday 6AM GMT?
     
    mark_s, Jun 21, 2007 IP
  15. krakjoe

    krakjoe Well-Known Member

    Messages:
    1,795
    Likes Received:
    141
    Best Answers:
    0
    Trophy Points:
    135
    #15
    on first run set CACHE_FORCE to 1

    create the cache on a monday at 6 am and set CACHE_DAYS to 7
     
    krakjoe, Jun 21, 2007 IP
  16. mark_s

    mark_s Peon

    Messages:
    497
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    0
    #16
    Thanks a lot, fantastic!

    The scrape code you used is for the 'Entry' ranking but I also want to show the 'Race' ranking. The code is almost identical apart from one word. So what would you suggest the best of implementation with your cache for my other scrape?

    <?php
    $data = file_get_contents('http://www.atptennis.com/3/en/players/playerprofiles/?playernumber=MC10');
    $regex = '/<div class="data">Current ATP Race - Singles:<\/div>\s*<\/td>\s*<td valign="middle"><span class="lines">\s*(\d+)\s*<\/span>/';
    preg_match($regex,$data,$race);
    echo $race[1];
    ?>
    Code (markup):
     
    mark_s, Jun 21, 2007 IP
  17. UnrealEd

    UnrealEd Peon

    Messages:
    148
    Likes Received:
    7
    Best Answers:
    0
    Trophy Points:
    0
    #17
    just use the same method, but with a different regex to get the value of the ranking:
    echo $scrape->operation( '/<div class="data">Current ATP Race - Singles:<\/div>\s*<\/td>\s*<td valign="middle"><span class="lines">\s*(\d+)\s*<\/span>/', 1 );
    PHP:
     
    UnrealEd, Jun 21, 2007 IP
  18. mark_s

    mark_s Peon

    Messages:
    497
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    0
    #18
    Thank you UnrealEd and krakjoe for your amazing help. I really appreciate it.

    I'll test whether this cache works come this Monday :D
     
    mark_s, Jun 21, 2007 IP
  19. davidseq2007

    davidseq2007 Peon

    Messages:
    2
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #19
    if those guys change their layout your regexp won't work and so your site.
    how about find another site with the data you need supplied by rss?

    MVC is complex, Web2.0 is simpler.
    Sybrain Framework
    http://www.sybrain.com
     
    davidseq2007, Jun 21, 2007 IP
  20. mark_s

    mark_s Peon

    Messages:
    497
    Likes Received:
    10
    Best Answers:
    0
    Trophy Points:
    0
    #20
    No I don't think there is an RSS available for the ranking.

    Is there no code that can be given so that if the ATP site changes it doesn't affect my site loading? Like a timeout function or something?
     
    mark_s, Jun 21, 2007 IP