Ok the situation is like this: I have a table where I need to retrieve the latest records between a certain time, and also given the name(s) that need to match another column... Table structure for position update table: The query I perform is below. Note that when I run this against a table with 1 + million records, I haven't yet seen it complete Strangely when I hard code the inner select statement it is quite a bit faster. Although it can still take very large (unnatural) amount of time to complete: The explain statement of the slow SQL query is below (note: I ran this against a small table): "id","select_type" ,"table" ,"type" ,"possible_keys" ,"key" ,"key_len","ref" ,"rows","Extra" 1, "PRIMARY" ,"<derived2>" ,"system" ,"" ,"" ,"" ,"" ,1,"" 1, "PRIMARY" ,"tblPositionUpdates","const" ,"PRIMARY" ,"PRIMARY" ,"4" ,"const" ,1,"" 2, "DERIVED" ,"tblPositionUpdates","index" ,"Index_UTCTime" ,"Index_TargetName","52" ,"" ,11087,"Using where" 3, "DEPENDENT SUBQUERY","tblProjectTargets" ,"index_subquery","fkidTargetName,fkidProject","fkidTargetName" ,"52" ,"func" ,1 ,"Using index; Using where" Code (markup): I suspect it is the fact that I am using large indexes. Any help/suggestions would be greatly appreciated.
Have you looked at a Query Execution plan? This is often the best way to find bottlenecks - a WHERE IN will scan the left side top to bottom for each item in the right side - if both sides have numerous records this can be a nightmare.
Thanks for the suggestion. Do you know of any free software that will analyze my queries for performance issues such as the one I am having? Unfortunately both sides of the WHERE IN can have large set of values. In fact the left side is so large that it will take over 5 minutes to complete. I always imagined that the WHERE IN would make the query much faster as it would be able to use the indexes, and not have to search the whole table. However I don't think this is always the problem. For example if the inner select statement : (select fkidTargetName from tblProjectTargets where fkidProject = 9) is empty (lets say 9 is not a valid project). It still takes almost as long to execute as if it returned values.
That's one hell of a sql you got there. This is probably what you want: select * from tblPositionUpdates where idPositionUpdate in (select max(idPositionUpdate) as id from tblPositionUpdates where sgTargetName in (select fkidTargetName from tblProjectTargets where fkidProject = 9) and inUTCTime <= 853229662 group by sgTargetName) ) PHP:
Yeah that is just one of the slower queries I am using, in fact there are about 10 more. But if I can solve this one I'll be able to sort out most of the others too. I tried that query, doesn't appear to make a difference in terms of speed. (I killed the thread after it continued running for over 8 minutes). It seems that when MySQL begins copying into the tmp table it takes the largest proportion of time. Remember the tblPositionUpdate presently contains about 3 million records ... Is WHERE IN generally a better option than the INNER JOIN?
WHERE IN is more correct than using INNER JOIN, in this case. Even though the result might be the same, its confusing to figure out why you are using join. How many rows does this query return? (select fkidTargetName from tblProjectTargets where fkidProject = 9) Out of the 3 million rows, if you do a select distinct sgTargetName, how many different target names are there?
The reason I am using a join here is that I wanted it to force using the identifier as the "connector" between the two queries. This query (select fkidTargetName from tblProjectTargets where fkidProject = 9) can return at most around 160 rows depending on the fkidProject. But on average most project(s) will only return around 4 TargetNames. As I mentioned before the query will seem just as slow even with less (say 4 fkidTargetNames), unless I "hardcode" this inner select for example: ('TargetName1', 'TargetName2' ...). There are around 160 distinct (unique) target names out of the 3 million rows. So nothing too large really. I just want to return the most recent row(s) for each target name within a project that are within a certain date/time span. There must be something i'm missing
What version of Mysql are you using? Anything below 5 is going to have a hard time with subselect statements. Usually any query is going to have a hard time, but with versions below 5.0 mysql's query engine doesn't do a good job figuring out the least cost and just makes the query go crazy..
I'm using version 5.0. So basically I'm thinking MySQL is just not up to the task. Unfortunately I don't have much choice with what db I get to work with.