optimize the loop

stats Well-Known Member

Messages:: 586

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 110

#1

i have this kind of a situation.

I have a mysql table called ToKeepDB which contains some IDs
furthermore, i have a folder filled in with photos which have their names in format *ID* (means anything+ID+anything).
And to add to that, there may be more than one photos containing the same ID (example: 2247_1.jpg , 2247_2.jpg, old_2247_1.jpg, etc)

Now i want to delete all those photos that don't have their ID in the ToKeepDB

so i wrote the following code
	$dirhandle = opendir($path);
	while ($filename = readdir($dirhandle)) {
		if (!is_dir("$path/$filename")) {
		$exist = 0;
		$indbquery = mysql_query("SELECT `ID` from `ToKeepDB`");
		while ($checkline = mysql_fetch_array($indbquery)) {
			$checkname = $checkline['ID'];
			if (preg_match("^$checkname^", $filename)) {
				$exist = 1;
				break;
			}
		}
		if ($exist == 0) {
			$execute = "rm -f $path/$filename";
			`$execute`;
			echo "$path/$filename  has been deleted<br>\n";
		}
	}
	}
PHP:
The logic seems to work just fine. However there is a big problem. The folder contains some 100k photos, and the database contains some 20-30k IDs , so the order of the loops that take place in this script is 100,000 x 20,000 which is huge .. well, at least huge enough to bother the server

As a result i'm getting some 10-15 files deleted from that folder and after that my account is being automatically suspended for exciting the CPU quota of the server ..

can someone please tell me a more efficient logic to do this thing ? Just keep in mind that i'm talking about large amount of files and large amount of lines in DB.

Thanks

stats, Nov 5, 2007 IP

Galen Peon

Messages:: 11

Likes Received:: 0

Best Answers:: 0

Trophy Points:: 0

#2

Get all the ID's from the database with 1 query and put that into an array. Then use scandir(if using php5, otherwise manually) and put all the files into an array.

then use those to remove the files.

Galen, Nov 6, 2007 IP

stats Well-Known Member

Messages:: 586

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 110

#3

What you have just said was my first version of the script .. i have made 2 arrays, one containing all the IDs from DB and the other containing all the filenames from the folder. And after that i started comparing them .. i was getting the same, so i escalated the script to the one shown above in a hope that removing 2 large arrays from memory would help some .. but it obviously didn't help ..

the only difference between my original script and what you've just said is the scandir thing which i don't know what it is and i didn't use that .. and my best guess is my server doesnt use php5

stats, Nov 6, 2007 IP

phper Active Member

Messages:: 247

Likes Received:: 17

Best Answers:: 0

Trophy Points:: 60

#4

First of all, optimisation aside, there is high possibility of the script deleting files incorrectly.

There, you are deleting all files that are matching any ID in the database. So if you have ID 1 in the database, these files will be deleted: 1.jpg, 10.jpg, 11.jpg, 21.jpg, 310.jpg, 22_1.jpg, ....

I think this way is better:
Step 1:
1. loop through the directory
2. for each file, extract the ID out of the file name
3. store the ID in an array (let's name this array $idsFromFile)
end loop
4. run array_unique() on that array of IDs to remove duplicates.

Step 2:
5. query "SELECT `ID` from `ToKeepDB`"
6. store all of these IDs in another array (let's name this array $idsFromDb)

Step 3:
7. loop through $idsFromFile
8. for each ID, check if it's in $idsFromDb => http://php.net/in_array
9. if not, remove the file => http://php.net/unlink

phper, Nov 6, 2007 IP

stats Well-Known Member

Messages:: 586

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 110

#5

well yes, from the example i wrote above it seems to be a good chance of deleting innocent files .. but the reality is some different, and the ID thing is priety big enough and won't affect other files .. an example of a real filename that i have is like FG362347GAN_1.jpg , FG362347GAN_2.jpg, etc.. Or 1_GA21346.jpg , 2_GA21346.jpg , etc

in the step1 point2, where you say to exctract the ID out of the filename .. i whish i could do that .. that would solve my problem .. but that is not possible because i have no mathematical definition of the ID (such as 12 digits + 3 characters) .. the ID is simply not machine-catchable .. it may differ from being a collection of 15 digits+characters to say just 7 digits etc ..

and the rest .. yes, that's what i'm doing .. gettin both things in arrays and then start the check process .. but the thing is that i am ending up with 2 arrays, currently 300k and 24k .. so the number of operations required is 24,000x300,000 which is huge and crashes the PHP engine ..

what i did now, is that i've put up those 2 arrays in separate SQL tables, and then started comparing those tables and putting a 1 in front of the filenames that are to be kept and leaving a default 0 in front of the filenames that are to be deleted .. (after that i'm just planing to run a process that will delete all the files that have a 0 in front of their filename in the SQL table)

well .. this may be a stupid way to do the things, but at least it didn't crush my engine and i didn't get over CPU quota .. i'm currently runing the script for 5 hours already, and thus far it has found some 100k (out of the 300k) images that are to be kept .. i'm just waiting for the process to finish so then i go trough the deleteion logic ..

What i learned from this, is that when things get bigger - there are principal differences between what you knew before ..

Still , this is not a one-time process but i have to set up this thing in the cronjob to do daily .. so if a good experienced programmer can suggest me something that works better with this kind of huge amount of data, i would be very thankfull

stats, Nov 6, 2007 IP

phper Active Member

Messages:: 247

Likes Received:: 17

Best Answers:: 0

Trophy Points:: 60

#6

From what you said, you seem to be misunderstanding what an 'array' is.

The reason why Galen and I suggested storing the values in php arrays and working on them, instead of querying the database many times is to save resources in:
- querying the database (send query, process query, return result, process result)
- keeping the database connection open for a long time
- disk access (when you load the data to php arrays, you are working on memory, which is much faster than disk).

in the step1 point2, where you say to exctract the ID out of the filename .. i whish i could do that .. that would solve my problem .. but that is not possible because i have no mathematical definition of the ID (such as 12 digits + 3 characters) .. the ID is simply not machine-catchable .. it may differ from being a collection of 15 digits+characters to say just 7 digits etc ..
Click to expand...

If that's the case, how do you do it when you need to, say, pull out images for ID x? I imagine you need to do this quite often, don't you?

this is not a one-time process but i have to set up this thing in the cronjob to do daily
Click to expand...

I would instead look into correcting the database and/or script design and avoid the need to do this in the first place.

phper, Nov 6, 2007 IP

stats Well-Known Member

Messages:: 586

Likes Received:: 8

Best Answers:: 0

Trophy Points:: 110

#7

phper - Using 2 arrays and then comparing them was my first version of the script. I understand all the priveleges of doing that that you wrote. However, because those arrays were too big i was always getting "Over CPU Quota" and my account suspended on my shared hosting .. the server simply wasn't letting me do that for that big arrays .. I understand that you are right about the arrays, but it simply didn't work with my server and i could do nothing about it

regarding pulling out the photos.. see i'm getting those photos from different sources and they are formatted diferently. The only common thing they have is that they always include the ID in filename .. so for example one source gives me ASD523476_1.jpg (where the ASD523476 is the ID) and the other source gives me 1-245722340.jpg (where the 245722340 is the ID) and a third source give me something else .. but still it has the ID in it. Those IDs are guaranteed to be unique, so i'm not worried about it

I can't avoid the need to do this every day because the situation is .. every day i'm downloading a zip file containing photos (say a 100k of them) and a CSV file containing IDs. Next, i unzip the photos into the folder and put up the CSV into SQL. the CSV contains all the IDs of those photos that i need to keep (so for the first day it exactly matches with the photos). But the next day when i do the same routine, i'm getting some more photos (say a 20k of them) and a CSV file telling which ones to keep. Now, on that second day, the CSV file may tell me that some photos from yesterday are outdated and i don't need them .. so i have to delete them to keep things clean. So day after day i'm getting all those updateds where i have to delete the outdated photos and keep (and add) only those that are in that day's CSV file .. so as you can see this is an everyday routine ..

stats, Nov 7, 2007 IP

Log in or Sign up

optimize the loop

stats Well-Known Member

Galen Peon

stats Well-Known Member

phper Active Member

stats Well-Known Member

phper Active Member

stats Well-Known Member

Useful Searches