Is this query irreducibly complex?

问题

I have two MySQL database tables, described below. One table holds device information, and the other is a one-to-many log about each device.

CREATE TABLE  `device` (
  `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  `name` VARCHAR(255) NOT NULL,
  `active` INT NOT NULL DEFAULT 1,
  INDEX (`active`)
);

CREATE TABLE  `log` (
  `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  `device_id` INT NOT NULL,
  `message` VARCHAR(255) NOT NULL,
  `when` DATETIME NOT NULL,
  INDEX (`device_id`)
);

What I want to do is grab device information along with the latest log entry for each device in a single query (if possible). So far, what I have is the following:

SELECT d.id, d.name, l.message
FROM device AS d
LEFT JOIN (
  SELECT l1.device_id, l1.message
  FROM log AS l1
  LEFT JOIN log AS l2 ON (l1.device_id = l2.device_id AND l1.when < l2.when)
  WHERE l2.device_id IS NULL
) AS l ON (d.id = l.device_id)
WHERE d.active = 1
GROUP BY d.id
ORDER BY d.id ASC;

These queries are simplified reproductions of my actual setup, where my log table is over 100k rows (and there are actually several log tables I look at). The query does run, however very, very slowly (say, more than two minutes). I'm convinced that there is a more concise/elegant/"SQL" way to form this query to get the data I need, but I just haven't found it yet.

Is what I want to do even possible without the ugly sub-SELECT and self-JOIN? Can I get the job done with a different strategy? Or, is the very nature of the query something that is irreducibly complex?

Again, the application logic is such that I can "manually JOIN" the tables if this isn't going to work, but I feel like MySQL should be able to handle something like this without choking - but I'm admittedly green when it comes to this kind of complex set algebra.

EDIT: As this is a contrived example, I'd forgotten to add the index to device.active

回答1:

Here's a slightly different approach to your query that avoids the self-join:

SELECT d.id, d.name, l.message
FROM device AS d
LEFT JOIN (
  SELECT l1.device_id, l1.message
  FROM log AS l1
  WHERE l1.when = (
        SELECT MAX(l2.when)
        FROM log AS l2
        WHERE l2.device_id = l1.device_id
  ) l ON l.device_id = d.id
WHERE d.active = 1
ORDER BY d.id ASC;

Since 100k isn't a very large table, even without the proper indexes I wouldn't expect this query to take more than a few seconds. However, like the comments suggest, you might consider adding additional indexes based on the results of your explain plan.

回答2:

Here's an alternative that requires only one instance of the log table:

SELECT    d.id, d.name, 
          SUBSTRING_INDEX(
              GROUP_CONCAT(
                  l.message 
                  SEPARATOR '~' 
                  ORDER BY l.when DESC
              ) 
          ,   '~'
          ,   1
          )
FROM      device d
LEFT JOIN log    l
ON        d.id = l.device_id
WHERE     d.active = 1
GROUP BY  d.id

This query finds the last log message by creating a tilde-separated list of messages, sorted by the date in descending order. That's done by the GROUP_CONCAT. The SUBSTRING_INDEX chips of the first entry of that list.

There are 2 drawbacks to this approach:

it uses GROUP_CONCAT. If the result of that function becomes too long, the result is truncated. You can remedy that if you do

SET @@group_concat_max_len = @@max_allowed_packet;

before running the query. You can do even better than that: since you're only interested in getting only one message, you can set group_concat_max_len to be as large as the maximum character length of the message column. This will save considerable memory as compared to using @@max_alowed_packet.

it relies on a special separator (in the example, it's tilde ('~')) that must not appear within the message text. You can change this to any separator string you like as long as you're sure it doesn't appear inside the message text.

If you can live with these limitations, then this query is probably the fastest.

Here are more alternatives that are about as complex as yours but may perform better.

SELECT    d.id
,         d.name
,         l.message
FROM      (
          SELECT    d.id, d.name, MAX(l.when) lmax
          FROM      device d
          LEFT JOIN log    l
          ON        d.id = l.device_id
          WHERE     d.active  = 1
          GROUP BY  d.id
          ) d
LEFT JOIN log       l
ON        d.id   = l.device_id
AND       d.lmax = l.when
ORDER BY d.id ASC;

another alternative:

SELECT    d.id
,         d.name
,         l2.message
FROM      device d
LEFT JOIN (
          SELECT   l.device_id
          ,        MAX(l.when) lmax
          FROM     log l
          GROUP BY l.device_id
          ) l1
ON        d.id = l1.device_id 
LEFT JOIN log       l2
ON        l1.device_id = l2.device_id
AND       l1.lmax      = l2.when
WHERE     d.active     = 1
ORDER BY  d.id ASC;

回答3:

Your query, and the strategies below will benefit from an index ON log(device_id,when). That index can replace the index ON log(device_id), since that index would be redundant.

If you have a whole boatload of log entries for each a device, the JOIN in your query is going to generate a good sized intermediate result set, which will get filtered down to one row per device. I don't believe the MySQL optimizer has any "shortcuts" for that anti-join operation (at least not in 5.1)... but your query might be the most efficient.

Q: Can I get the job done with a different strategy?

Yes, there are other strategies that, but I don't know that any of these is any "better" than your query.

UPDATE:

One strategy you might consider is adding another table to your schema, one that holds the most recent log entry for each device. This could be maintained by TRIGGERs defined on the log table. If you are only performing inserts (no UPDATE and no DELETE of the most recent log entry, this is fairly straightforward. Whenever an insert is performed against the log table, an AFTER INSERT FOR EACH ROW trigger is fired, which compares the when value being inserted into the log table for a device_id, to the current when value in the log_latest table, and inserts/updates the row in the log_latest table so that the most recent row is always there. You could also (redundantly) store the device name in the table. (Alternatively, you could add a latest_when and a latest_message columns to the device table, and maintain them there.)

But this strategy goes beyond your original question... but it is a workable strategy to consider if you need to frequently run a "latest log message for all devices" query. The downside is that you have an extra table, and a performance hit when performing inserts to the log table. This table could be entirely refreshed using a query like your original one, or the alternatives below.

One approach is a query that does a simple join of the device and log tables, gets the rows ordered by device and by descending when. Then use a memory variable to process the rows, to filter out all but the "latest" log entry. Note that this query returns an extra column. (This extra column can be removed by wrapping the whole query as an inline view, but you'll likely get better performance if you can live with an extra column being returned:

SELECT IF(s.id = @prev_device_id,0,1) AS latest_flag
     , @prev_device_id := s.id AS id
     , s.name
     , s.message
  FROM (SELECT d.id
             , d.name
             , l.message
          FROM device d
          LEFT
          JOIN log l ON l.device_id = d.id
         WHERE d.active = 1
         ORDER BY d.id, l.when DESC
       ) s
  JOIN (SELECT @prev_device_id := NULL) i
HAVING latest_flag = 1

What the first expression in the SELECT list is doing is "marking" a row whenever the device id value on that row is DIFFERS from the device id on the PREVIOUS row. The HAVING clause filters out all of the rows that aren't marked with a 1. (You can omit the HAVING clause to see how that expression works.)

(I didn't test this for syntax errors, if you get an error, let me know, and I will take a closer look. My desk checking says it's fine... but it's possible I missed a paren or comma,)

(You can "get rid of" that extra column by wrapping that in another query

SELECT r.id,r.name,r.message FROM (
/* query from above */
) r

(but again, this may impact performance, you'd likely get better performance if you can live with the extra column.)

Of course, add an ORDER BY on the outermost query to guarantee that your resultset is ordered the way you need it.

This approach would work fairly well for a whole bunch of devices, and only a couple of related rows in log. Otherwise, this is going to generate a huge mess of a intermediate result set (on the order of the number of rows in the log table) which is going to have to be spun out to a temporary MyISAM table.

UPDATE:

If you are getting essentially all of the rows from device (where the predicate is not very selective), you can probably get better performance by getting the latest log entry for every device_id in the log table, and postpone the join to the device table. (But note that an index will not be available on that intermediate result set to do the join, so it would really need to be tested to gauge performance.)

SELECT d.id
     , d.name
     , t.message
  FROM device d 
  LEFT
  JOIN (SELECT IF(s.device_id = @prev_device_id,0,1) AS latest_flag
             , @prev_device_id := s.device_id AS device_id
             , s.messsage
          FROM (SELECT l.device_id
                     , l.message
                  FROM log l
                 ORDER BY l.device_id DESC, l.when DESC
               ) s
          JOIN (SELECT @prev_device_id := NULL) i
        HAVING latest_flag = 1
       ) t
    ON t.device_id = d.id

NOTE: We specify descending order on both the device_id and when columns in the ORDER BY clause of the inline view aliased as s, not because we need the rows in descending device_id order, but to allow to avoid a filesort operation by allowing MySQL to perform a "reverse scan" operation on an index with leading columns (device_id, when).

NOTE: This query still going to spool off intermediate result set as a temporary MyISAM tables, and there won't be any index on those. So its likely this won't perform as well as your original query.

Another strategy is to use a correlated subquery in the SELECT list. You are only returning a single column from the log table, so this is fairly easy query to understand:

SELECT d.id
     , d.name
     , ( SELECT l.message
           FROM log l
          WHERE l.device_id = d.id
          ORDER BY l.when DESC 
          LIMIT 1
       ) AS message
  FROM device d
 WHERE d.active = 1
 ORDER BY d.id ASC;

NOTE: Since id is the PRIMARY KEY (or a UNIQUE KEY) in the device table, and because you aren't doing any JOIN that will generate extra rows, you can omit the GROUP BY clause.

NOTE: This query will use a "nested loops" operation. That is, for each row returned from the device table, (essentially) a separate query needs to be run to get the related row from log. For only a few device rows (as would be returned with a more selective predicate on the device table), and with a boatload of log entries for each device, performance will not be too bad. But for a lot of devices which each only have a few log messages, other approaches are very likely going to be much more efficient.)

Also note, with this approach, note that you can easily extend it to also return the second latest log message as a separate column, by adding another subquery (just like that first one) to the SELECT list, just changing the LIMIT clause to skip the first row, and get the second row instead.

     , ( SELECT l.message
           FROM log l
          WHERE l.device_id = d.id
          ORDER BY l.when DESC 
          LIMIT 1,1
       ) AS message_2

For getting basically all the rows from device, you'll likely get the best performance using JOIN operations. The one drawback of this approach is that it has the potential to return multiple rows for a device, when there are two (or more) rows that have a matching latest when value for a device. (Basically, this approach is guaranteed to return a "correct" resultset when we have a guarantee that log(device_id,when) is unique.

With this query as an inline view, to get the "latest" when value:

SELECT l.device_id
     , MAX(l.when)
  FROM log l
 GROUP BY l.device_id

We can join this to the log and device tables.

SELECT d.id
     , d.name
     , m.messsage
  FROM device d
  LEFT
  JOIN (
         SELECT l.device_id
              , MAX(l.when) AS `when`
           FROM log l
          GROUP BY l.device_id 
       ) k
    ON k.device_id = d.id
  LEFT
  JOIN log m 
    ON m.device_id = d.id
       AND m.device_id = k.device_id
       AND m.when = k.when
 ORDER BY d.id

All of these are alternate strategies (which I believe is the question you asked), but I'm not sure either of those is going to be better for your particular needs. (But it's always good to have a couple of different tools in the tool belt, to use as appropriate.)

来源：https://stackoverflow.com/questions/11729820/is-this-query-irreducibly-complex

标签

mysql

sql

join

self-join