MySQL tagging question: how to select an item that has been tagged as X, Y, and Z?

我的梦境 提交于 2019-12-22 09:55:36

问题


I'm dealing with a database where items are "tagged" a certain number of times.

item (100k rows)

  • id
  • name
  • other stuff

tag (10k rows)

  • id
  • name

item2tag (1,000,000 rows)

  • item_id
  • tag_id
  • count

I'm looking for the fastest solution to:

Select items that have been tagged as X, Y, and Z (where X, Y, and Z correspond to (possibly) tag names) ?

Here's what I have so far... I'd just like to make sure I'm doing it in the best way possible:

First get the tag_ids from the names:

SELECT tag.id WHERE name IN ("X","Y","Z");

Then I group by those tag_ids and use Having to filter the result:

SELECT item2tag.*, count(tag_id)
  FROM item2tag
  WHERE tag_id=1 or tag_id=2 or tag_id=3
GROUP BY item_id
HAVING count(tag_id)=3;

Then I can just select from item with those ids.

SELECT * FROM item WHERE id IN ([results from prior query])

I have millions of rows in item2tag, with an index on (item_id, tag_id). Is this going to be the fastest solution?


回答1:


The method you have suggested is probably the most common way to perform the query but might not be the fastest. Using joins can be faster:

SELECT T1.item_id
FROM item2tag T1
JOIN item2tag T2 ON T1.item_id = T2.item_id
JOIN item2tag T3 ON T2.item_id = T3.item_id
WHERE T1.tag_id = 1 AND T2.tag_id = 2 AND T3.tag_id = 3

You should ensure that you have the following indexes:

  • Primary key on (item_id, tag_id)
  • Index on (tag_id).

I performance tested this query against the original in a few different scenarios.

  • For the case where nearly every item in the table is tagged with at least one of the tags being searched for, the original query takes about 5 seconds and the JOIN version takes about 10 seconds - slightly slower.
  • For the case where two of the tags occur very frequently and one of the tags occurs only very rarely the original query takes about 0.9 seconds, whereas the JOIN query takes just 0.003 seconds - a considerable performance improvement.

The SQL I used to make performance test is pasted below. You can run this test yourself or modify it slightly and test other queries, or different scenarios.

Warning: Don't run this script on your production database as it modifies the contents of the item2tag table. Running the script can take a few minutes as it creates a lot of data.

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$
CALL prc_filler(1000000);

CREATE TABLE item2tag (
    item_id INT NOT NULL,
    tag_id INT NOT NULL,
    count INT NOT NULL
);

INSERT INTO item2tag (item_id, tag_id, count)
SELECT  id % 150001, id % 10, 1
FROM    filler;
ALTER TABLE item2tag ADD PRIMARY KEY (item_id, tag_id);
ALTER TABLE item2tag ADD KEY (tag_id);

-- Make tag 3 occur rarely.    
UPDATE item2tag SET tag_id = 10 WHERE tag_id = 3 AND item_id > 0;

SELECT T1.item_id
FROM item2tag T1
JOIN item2tag T2 ON T1.item_id = T2.item_id
JOIN item2tag T3 ON T2.item_id = T3.item_id
WHERE T1.tag_id = 1 AND T2.tag_id = 2 AND T3.tag_id = 3;

SELECT item_id
FROM item2tag
WHERE tag_id=1 or tag_id=2 or tag_id=3
GROUP BY item_id
HAVING count(tag_id)=3;



回答2:


You'll be better placed having an index that has tag_id as the first column - otherwise finding all items with tag_id 1 will require a full table scan (same for any tag_id, of course).




回答3:


Depending on how many items are tagged with individual tags, you might do it by getting list of items tagged with one tag, and then filtering it for occurences of other tags, like this:

select item_id from item2tag
where item_id in (
    select item_id from item2tag
    where item_id in (
        select item_id from item2tag where tag_id = TID1
    ) and tag_id = TID2
) and tag_id = TID3


来源:https://stackoverflow.com/questions/3260543/mysql-tagging-question-how-to-select-an-item-that-has-been-tagged-as-x-y-and

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!