MySQL SELECT most frequent by group

前端 未结 5 1327
盖世英雄少女心
盖世英雄少女心 2020-12-01 22:08

How do I get the most frequently occurring category for each tag in MySQL? Ideally, I would want to simulate an aggregate function that would calculate the mode of a column.

相关标签:
5条回答
  • 2020-12-01 22:18

    Here's a hacky approach to this which utilizes the max aggregate function seeing as there is no mode aggregate function in MySQL (or windowing functions etc.) that would allow this:

    SELECT  
      tag, 
      convert(substring(max(concat(lpad(c, 20, '0'), category)), 21), int) 
            AS most_frequent_category 
    FROM (
        SELECT tag, category, count(*) AS c
        FROM tags INNER JOIN stuff using (id) 
        GROUP BY tag, category
    ) as grouped_cats 
    GROUP BY tag;
    

    Basically it utilizes the fact that we can find the lexical max of the counts of each individual category.

    This is easier to see with named categories:

    create temporary table tags (id int auto_increment primary key, tag character varying(20));
    create temporary table stuff (id int, category character varying(20));
    insert into tags (tag) values ('automotive'), ('ba'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('bamboo'), ('banana tree'), ('banana tree'), ('banana tree'), ('banana tree'), ('bath');
    insert into stuff (id, category) values (1, 'cat-8'), (2, 'cat-8'), (3, 'cat-8'), (4, 'cat-8'), (5, 'cat-8'), (6, 'cat-8'), (7, 'cat-8'), (8, 'cat-10'), (9, 'cat-8'), (10, 'cat-9'), (11, 'cat-8'), (12, 'cat-10'), (13, 'cat-8'), (14, 'cat-9'), (15, 'cat-8'), (16, 'cat-8'), (17, 'cat-8'), (18, 'cat-8'), (19, 'cat-8'), (20, 'cat-9');
    

    In which case we shouldn't be doing integer conversion on the most_frequent_category column:

    SELECT 
      tag, 
      substring(max(concat(lpad(c, 20, '0'), category)), 21) AS most_frequent_category 
    FROM (
        SELECT tag, category, count(*) AS c
        FROM tags INNER JOIN stuff using (id) 
        GROUP BY tag, category
    ) as grouped_cats 
    GROUP BY tag;
    
    +-------------+------------------------+
    | tag         | most_frequent_category |
    +-------------+------------------------+
    | automotive  | cat-8                  |
    | ba          | cat-8                  |
    | bamboo      | cat-8                  |
    | banana tree | cat-8                  |
    | bath        | cat-9                  |
    +-------------+------------------------+
    

    And to delve a little bit more into what is going on, here's what the grouped_cats inner select looks like (I've added order by tag, c desc):

    +-------------+----------+---+
    | tag         | category | c |
    +-------------+----------+---+
    | automotive  | cat-8    | 1 |
    | ba          | cat-8    | 1 |
    | bamboo      | cat-8    | 9 |
    | bamboo      | cat-10   | 2 |
    | bamboo      | cat-9    | 2 |
    | banana tree | cat-8    | 4 |
    | bath        | cat-9    | 1 |
    +-------------+----------+---+
    

    And we can see how the max of the count(*) column drags along it's associated category if we omit the substring bit:

    SELECT 
      tag, 
      max(concat(lpad(c, 20, '0'), category)) AS xmost_frequent_category
    FROM (
        SELECT tag, category, count(*) AS c
        FROM tags INNER JOIN stuff using (id) 
        GROUP BY tag, category
    ) as grouped_cats 
    GROUP BY tag;
    
    +-------------+---------------------------+
    | tag         | xmost_frequent_category   |
    +-------------+---------------------------+
    | automotive  | 00000000000000000001cat-8 |
    | ba          | 00000000000000000001cat-8 |
    | bamboo      | 00000000000000000009cat-8 |
    | banana tree | 00000000000000000004cat-8 |
    | bath        | 00000000000000000001cat-9 |
    +-------------+---------------------------+
    
    0 讨论(0)
  • 2020-12-01 22:20
    SELECT  tag, category
    FROM    (
            SELECT  @tag <> tag AS _new,
                    @tag := tag AS tag,
                    category, COUNT(*) AS cnt
            FROM    (
                    SELECT  @tag := ''
                    ) vars,
                    stuff
            GROUP BY
                    tag, category
            ORDER BY
                    tag, cnt DESC
            ) q
    WHERE   _new
    

    On your data, this returns the following:

    'automotive',  8
    'ba',          8
    'bamboo',      8
    'bananatree',  8
    'bath',        9
    

    Here's the test script:

    CREATE TABLE stuff (tag VARCHAR(20) NOT NULL, category INT NOT NULL);
    
    INSERT
    INTO    stuff
    VALUES
    ('automotive',8),
    ('ba',8),
    ('bamboo',8),
    ('bamboo',8),
    ('bamboo',8),
    ('bamboo',8),
    ('bamboo',8),
    ('bamboo',10),
    ('bamboo',8),
    ('bamboo',9),
    ('bamboo',8),
    ('bamboo',10),
    ('bamboo',8),
    ('bamboo',9),
    ('bamboo',8),
    ('bananatree',8),
    ('bananatree',8),
    ('bananatree',8),
    ('bananatree',8),
    ('bath',9);
    
    0 讨论(0)
  • 2020-12-01 22:25

    This is for simpler situations:

    SELECT action, COUNT(action) AS ActionCount FROM log GROUP BY action ORDER BY ActionCount DESC;

    0 讨论(0)
  • 2020-12-01 22:29

    (Edit: forgot DESC in ORDER BYs)

    Easy to do with a LIMIT in the subquery. Does MySQL still have the no-LIMIT-in-subqueries restriction? Below example is using PostgreSQL.

    => select tag, (select category from stuff z where z.tag = s.tag group by tag, category order by count(*) DESC limit 1) AS category, (select count(*) from stuff z where z.tag = s.tag group by tag, category order by count(*) DESC limit 1) AS num_items from stuff s group by tag;
        tag     | category | num_items 
    ------------+----------+-----------
     ba         |        8 |         1
     automotive |        8 |         1
     bananatree |        8 |         4
     bath       |        9 |         1
     bamboo     |        8 |         9
    (5 rows)
    

    Third column is only necessary if you need the count.

    0 讨论(0)
  • 2020-12-01 22:37
    SELECT t1.*
    FROM (SELECT tag, category, COUNT(*) AS count
          FROM tags INNER JOIN stuff USING (id)
          GROUP BY tag, category) t1
    LEFT OUTER JOIN 
         (SELECT tag, category, COUNT(*) AS count
          FROM tags INNER JOIN stuff USING (id)
          GROUP BY tag, category) t2
      ON (t1.tag = t2.tag AND (t1.count < t2.count 
          OR t1.count = t2.count AND t1.category < t2.category))
    WHERE t2.tag IS NULL
    ORDER BY t1.count DESC;
    

    I agree this is kind of too much for a single SQL query. Any use of GROUP BY inside a subquery makes me wince. You can make it look simpler by using views:

    CREATE VIEW count_per_category AS
        SELECT tag, category, COUNT(*) AS count
        FROM tags INNER JOIN stuff USING (id)
        GROUP BY tag, category;
    
    SELECT t1.*
    FROM count_per_category t1
    LEFT OUTER JOIN count_per_category t2
      ON (t1.tag = t2.tag AND (t1.count < t2.count 
          OR t1.count = t2.count AND t1.category < t2.category))
    WHERE t2.tag IS NULL
    ORDER BY t1.count DESC;
    

    But it's basically doing the same work behind the scenes.

    You comment that you could do a similar operation easily in application code. So why don't you do that? Do the simpler query to get the counts per category:

    SELECT tag, category, COUNT(*) AS count
    FROM tags INNER JOIN stuff USING (id)
    GROUP BY tag, category;
    

    And sort through the result in application code.

    0 讨论(0)
提交回复
热议问题