Hello everyone.. this is my first post.. I hope I don't do/say something stupid. I have a database which contains tables: item, tag, and item2tag, like so: item - id - description (varchar) - popularity (integer) tag - id - name (varchar) item2tag - id - item_id - tag_id - weight (float) The weights of an item's tags will add up to 1. For example, if an item has 5 entries in item2tag, the sum of their weights will be 1. Given an item.id, I'd like to find the 50 most similar items to it based on tags and their weights. I've gotten a query to work, but as the database grows, it's getting slower and slower: SELECT matcheditem.*, SUM(LEAST(matchedtags.weight, mytags.weight)) as similarity FROM item2tag as mytags, item as matcheditem, item2tag as matchedtags WHERE mytags.item_id=12345 AND mytags.tag_id=matchedtags.tag_id AND matchedtags.item_id=matcheditem.id GROUP BY matchedtags.item_id ORDER BY similarity DESC LIMIT 50 This query is taking a few seconds now, and is getting worse and worse as more items are added. All id and _id fields are indexed. There are about 40,000 items, and item2tag contains about 1,000,000 rows. I'm willing to run a larger query(s) or procedure that will create a table with 50 results for each entry, but would prefer a faster query. Any thoughts?
Make sure you have indexes on these fields: item2tag - id - item_id (... index) - tag_id (... index) - weight (float) It will take a while to add an index to a million row table, but this should speed up the query you provided.
I have indexes on those fields, queries are still taking 100 seconds! Would doing a sub-select help at all?