Query help on a self join for finding similar items

Discussion in 'Databases' started by devdude, Jul 29, 2008.

  1. #1
    Hello everyone.. this is my first post.. I hope I don't do/say something stupid.

    I have a database which contains tables: item, tag, and item2tag, like so:

    item
    - id
    - description (varchar)
    - popularity (integer)

    tag
    - id
    - name (varchar)

    item2tag
    - id
    - item_id
    - tag_id
    - weight (float)

    The weights of an item's tags will add up to 1. For example, if an item has 5 entries in item2tag, the sum of their weights will be 1.

    Given an item.id, I'd like to find the 50 most similar items to it based on tags and their weights. I've gotten a query to work, but as the database grows, it's getting slower and slower:

    SELECT matcheditem.*,
    SUM(LEAST(matchedtags.weight, mytags.weight)) as similarity
    FROM
    item2tag as mytags,
    item as matcheditem,
    item2tag as matchedtags
    WHERE
    mytags.item_id=12345
    AND mytags.tag_id=matchedtags.tag_id
    AND matchedtags.item_id=matcheditem.id
    GROUP BY matchedtags.item_id
    ORDER BY similarity DESC
    LIMIT 50

    This query is taking a few seconds now, and is getting worse and worse as more items are added. All id and _id fields are indexed. There are about 40,000 items, and item2tag contains about 1,000,000 rows.

    I'm willing to run a larger query(s) or procedure that will create a table with 50 results for each entry, but would prefer a faster query.

    Any thoughts?
     
    devdude, Jul 29, 2008 IP
  2. jestep

    jestep Prominent Member

    Messages:
    3,659
    Likes Received:
    215
    Best Answers:
    19
    Trophy Points:
    330
    #2
    Make sure you have indexes on these fields:
    item2tag
    - id
    - item_id (... index)
    - tag_id (... index)
    - weight (float)

    It will take a while to add an index to a million row table, but this should speed up the query you provided.
     
    jestep, Jul 30, 2008 IP
  3. devdude

    devdude Peon

    Messages:
    2
    Likes Received:
    0
    Best Answers:
    0
    Trophy Points:
    0
    #3
    I have indexes on those fields, queries are still taking 100 seconds!

    Would doing a sub-select help at all?
     
    devdude, Jul 30, 2008 IP