I have some PHP scraping on my website that grabs the latest ranking of a tennis player however half the time my site is 3x slower in loading due to the code. Is there a way I can avoid this lag? Do I need some advanced cache code or should I use a database to resolve this issue? The code I'm using: <?php $data = file_get_contents('http://www.atptennis.com/3/en/players/playerprofiles/?playernumber=MC10'); $regex = '/<div class="data">Current ATP Ranking - Singles:<\/div>\s*<\/td>\s*<td valign="middle"><span class="lines">\s*(\d+)\s*<\/span>/'; preg_match($regex,$data,$entry); echo $entry[1]; ?> Code (markup):
You shoudl seriously consider caching this. why? Everytime you load the page you are hitting atptennis.com server. Their webmaster may get annoyed specially if you have a lots of visitors. You can try to setup a cronjob and make this code run once a day or twice a day, store the result in db locally and display from there.
You don't even need a cronjob. I assume this site isn't updated every hour, so you can easily store a timestamp in your database (together with the wanted content), and then check in your script if a certain amount of time (let's say a day) has been exceeded, and then grab the contents again from the webpage.
Is there any chance you could give me the code for such a function? Or if that will take too much of your time, maybe some sort of tutorial? In fact the way the php scrape could work is for it to only grab the data at a certain time and day in the week. The rankings are updated every Monday morning. Thanks for the info, much appreciated.
sure np: Is the ranking the only thing you want to grab from the website? or do you want some other stuff as well? It's important cause you can speed up things if you only need the ranking, and some other things. For now i'm gonna assume you only need the ranking. What you need to do first is create a new table in your database, named atp_rankings (or something similar). This table should at least have 3 fields: 1 which will contain the name of the player, 1 which will contain the last_updated timestamp, and 1 which will contain the ranking of the player: CREATE TABLE `atp_rankings` ( `player` VARCHAR(255) NOT NULL, `date` TIMESTAMP NOT NULL default CURRENT_TIMSTAMP, `ranking` INT(4) NOT NULL ); Code (markup): Whenever a visitor wants to see the ranking of a certain player, you will have to check what the last time was you updated the ranking of the player. Since the site is updated every monday, you first need to check in php if the current day is monday, and then see if the date in the database is different than the one now (sounds crazy , i'll post some code a little further). If there's a different date, which means you have old data in your database, you need to fetch the webpage. After you fetched the data from the webpage, you use your regex to get the ranking, and this is what you should store in your database. And now in code format: $player = "Andy Murray"; $query = "SELECT UNIX_TIMESTAMP(date) AS date FROM atp_rankings WHERE player='" . $player . "'"; $day = date("N"); if ($day == 1) { // 1 == Monday, check out www.php.net $result = mysql_query($query) or die("Database Error: " . mysql_error()); $db_date = mysql_result($result, 0, "date"); // get the date from the resultset if (date("Y-m-d") > date("Y-m-d", $db_date)) { // get the ranking here $query = "UPDATE atp_rankings SET ranking=" . $ranking . ", date=NOW() WHERE player='" . $player . "'"; mysql_query($query) or die("Database Error: " . mysql_error()); } } // now grab the content from the database again to display it to the user PHP: I hope i helped you a little further I didn't test the code, and i wrote everything on the spot, so there will probably be some errors If you want to display some more information to the user from the webpage, you just have to create some additional fields in the table. If it's a lot you want to display, maybe it would be best if you store the <div> container with all information in 1 field, and use a regex to get the information you want to display everytime you display it the visitor
Wow! Thank you so much for all the code. I get this error in phpmyadmin when creating the table: #1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'CURRENT_TIMSTAMP, `ranking` INT(4) NOT NULL )' at line 3 Code (markup):
Thanks So I've created the table > put the PHP into rankingscrape.php > included the php on my website. How do I now make it display the ranking? And how do I make it so it only checks ATP.com on a Monday at 6AM GMT?
to display the ranking, just use a mysql_query and select the data according to the players name: $query = "SELECT ranking FROM atp_rankings WHERE player='" . $player . "'"; $result = mysql_query($query) or die("Database Error: " . mysql_error()); while ($row = mysql_fetch_assoc($result)) { echo $row['ranking']; } PHP: This script allready updates on mondays only. I don't think it's wise to only update at 6AM, because then the rankings will only be updated when a visitor is there exactly between 6:00 and 6:59 AM, otherwise the data isn't updated. I just thought of something else: suppose no one visits your website for 9 days, so skipping a monday. This means that your data will be 2 weeks old before you update. That's why it's best to add another check: see if the difference in days between the current date and the date in the database is larger than 7. If so, you need to update anyway. I think this should do the trick: $player = "Andy Murray"; $query = "SELECT UNIX_TIMESTAMP(date) AS date FROM atp_rankings WHERE player='" . $player . "'"; $result = mysql_query($query) or die("Database Error: " . mysql_error()); $db_date = mysql_result($result, 0, "date"); // get the date from the resultset $day = date("N"); if ($day == 1 || (date("z") - date("z", $db_date)) > 7) { // 1 == Monday, check out www.php.net if (date("Y-m-d") > date("Y-m-d", $db_date)) { // get the ranking here $query = "UPDATE atp_rankings SET ranking=" . $ranking . ", date=NOW() WHERE player='" . $player . "'"; mysql_query($query) or die("Database Error: " . mysql_error()); } } PHP: I'm not sure, but the additional condition might cause some problems when the year changes, i'd have to check, but now i need to get some sleep
Thanks for the code... just have a few things... 1) You say it checks on Monday... does that mean it will constantly pull the data from the external site on every hit throughout that day? Or only once? 2) Can I have the code that connects people to the database? 3) How is that code going to work when the actual PHP scrape is no where there?
No: Within the if that checks if today s monday, there's another one, which checks if the date in the database is older than today's date: if (date("Y-m-d") > date("Y-m-d", $db_date)) { PHP: So the script will only update once on monday, as i update the the date in the database with the current date whenever that if condition is true $con = mysql_connect($host, $username, $password) or die("Could not connect to the database: " . mysql_error()); mysql_select_db($database, $con) or die("Could Not Find Database"); PHP: Forgot all about that part What you need to do is see if there's a player named $player in the database, if not, you need to grab the data anyway. You will have to INSERT the data into the database instead of UPDATE-ing it. It's really not that difficult to write. The only function you'll be needing extra is: mysql_num_rows, and you will have to add another if
IMO using mysql in this case is over the top, caching the file on disk should be enough, regex matching doens't take any time at all .... <? /** * The number of days you wanna keep the cache on disk before retrieving a new copy from CACHE_URL **/ define( "CACHE_DAYS", 1 ) ; /** * The location on your server where you wanna keep the cache AND it's filename ( putting a . before it will keep it hidden ) **/ define( "CACHE_LOCATION", '.cache.disk' ) ; /** * If the file you're getting is particularly large, you might wanna squash it with gzcompression **/ define( "CACHE_GZ", 0 ) ; /** * The url to cache **/ define( "CACHE_URL", 'http://www.atptennis.com/3/en/players/playerprofiles/?playernumber=MC10' ); /** * If you change ANY settings, you MUST set this to one to create a new valid cache **/ define( "CACHE_FORCE", 0 ) ; class cache { function cache( ) { if( !file_exists( CACHE_LOCATION ) and !fopen( CACHE_LOCATION, 'w+' ) ) die( sprintf( "Make sure the server has permission to write to '%s'", CACHE_LOCATION ) ); elseif( !$this->isvalid( ) or CACHE_FORCE ) $this->retrieve( ); if( !( $this->disk = file_get_contents( CACHE_LOCATION ) ) ) die( sprintf( "Cannot read cache from '%s' into memory", CACHE_LOCATION ) ); } function time( ) { return time( ) - ( CACHE_DAYS * 3600 * 24 ); } function get( ) { return CACHE_GZ ? gzuncompress( $this->disk ) : $this->disk ; } function isvalid( ) { return ( filemtime( CACHE_LOCATION ) > $this->time( ) ); } function retrieve( $size = 4096 ) { if( !( $read = fopen( CACHE_URL, 'r' ) ) ) { die( sprintf( "Cannot open '%s' for reading", CACHE_URL ) ); } elseif( !( $write = fopen( CACHE_LOCATION, 'w+' ) ) ) { die( sprintf( "Cannot open '%s' for writing", CACHE_LOCATION ) ); } else { while( !feof( $read ) ) { $buffer[ ] = fgets( $read, $size ); } if( !fwrite( $write, CACHE_GZ ? gzcompress( implode( null, $buffer ) ) : implode( null, $buffer ) ) ) { die( sprintf( "Cannot write to '%s'", CACHE_LOCATION ) ); } fclose( $read ); fclose( $write ); return true ; } } } /** * An example of how you might implement such a cache **/ class scrape extends cache { /** * Construct scaper **/ function scrape( ) { /** * Construct the cache **/ parent::cache( ); } /** * Operate on cache **/ function operation( $regex, $index = null ) { preg_match( $regex, parent::get( ), $matches ); return $index ? $matches[ $index ] : $matches; } } /** * Create new scraper **/ $scrape = new scrape( ); /** * Operate on cache **/ echo $scrape->operation( '/<div class="data">Current ATP Ranking - Singles:<\/div>\s*<\/td>\s*<td valign="middle"><span class="lines">\s*(\d+)\s*<\/span>/', 1 ); ?> PHP:
Thanks guys... This is all a bit out of my understanding here. krakjoe, thanks for your input here... I get this message: CACHE Cannot read cache from 'rank_cache.disk' into memory Also, is there a way to get the cache cleared every Monday 6AM GMT?
Thanks a lot, fantastic! The scrape code you used is for the 'Entry' ranking but I also want to show the 'Race' ranking. The code is almost identical apart from one word. So what would you suggest the best of implementation with your cache for my other scrape? <?php $data = file_get_contents('http://www.atptennis.com/3/en/players/playerprofiles/?playernumber=MC10'); $regex = '/<div class="data">Current ATP Race - Singles:<\/div>\s*<\/td>\s*<td valign="middle"><span class="lines">\s*(\d+)\s*<\/span>/'; preg_match($regex,$data,$race); echo $race[1]; ?> Code (markup):
just use the same method, but with a different regex to get the value of the ranking: echo $scrape->operation( '/<div class="data">Current ATP Race - Singles:<\/div>\s*<\/td>\s*<td valign="middle"><span class="lines">\s*(\d+)\s*<\/span>/', 1 ); PHP:
Thank you UnrealEd and krakjoe for your amazing help. I really appreciate it. I'll test whether this cache works come this Monday
if those guys change their layout your regexp won't work and so your site. how about find another site with the data you need supplied by rss? MVC is complex, Web2.0 is simpler. Sybrain Framework http://www.sybrain.com
No I don't think there is an RSS available for the ranking. Is there no code that can be given so that if the ATP site changes it doesn't affect my site loading? Like a timeout function or something?