SQL query based on subquery. Retrieve transactions with data > threshold

问题

My db table is called transactions and is like this:

 Name |    Date (DateTime)   | Type |  Stock    | Volume | Price | Total
 Tom    2014-05-24 12:00:00    Sell   Barclays     100      2.2     220.0
 Bob    2014-04-13 15:00:00    Buy    Coca-Cola    10       12.0    120.0

varchar    DateTime           varchar varchar      int      float   float

My initial problem was to remove from the table ALL the transactions that belong to a user whose first transaction is later than a certain threshold. My query was:

DELETE FROM transactions WHERE name NOT IN (SELECT name FROM transactions2 WHERE date < CAST('2014-01-01 12:00:00.000' as DateTime));
Query OK, 35850 rows affected (3 hours 5 min 28.88 sec)

I think this is a poor solution, I had to duplicate the table to avoid deleting from the same table from where I am reading, and the execution took quite a long time (3 hours for a table containing ~170k rows)

Now I am trying to delete ALL the transactions that belong to a user whose latest transaction happened before a certain threshold date.

DELETE FROM transactions WHERE name IN (SELECT name FROM transactions HAVING max(date) < CAST('2015-01-01 12:00:00.000' as DateTime) );

Sadly, the subquery finds only one result:

SELECT name FROM transactions HAVING max(date) < CAST('2015-01-01 12:00:00.000' as DateTime)';

+------------+
| name       |
+------------+
| david      |
+------------+

I guess I am getting only one result because of the max() function. I am not an expert of SQL but I understand quite well what I need in terms of sets and logic. I would be really happy to have suggestions on how to rewrite my query.

EDIT: Here is a sqlfiddle with the schema and some data: http://sqlfiddle.com/#!2/389ede/2

I need to remove ALL the entries for alex, because his last transactions happened before a certain threshold (let's say 1 Jan 2013). Don't need to delete tom's transactions because he has his latest later than 1 Jan 2013.

回答1:

Your first query can be formulated as: `delete users from transactions where it does not exist a transaction for that user before ?. This is easy to transform to sql:

delete from transactions t1
where not exists (
    select 1 from transactions t2
    where t1.name = t2.name
      and t2.date < ?
)

mysql still does not support (AFAIK) deleting from a table that is referenced in a select, so we need to rewrite it as:

delete t1.* 
from transactions t1
left join transactions t2
    on t1.name = t2.name
   and t2.date < ?
where t2.name is null

date is a reserved word so you will have to quote that.

Your second query can be solved the same way, delete from transaction where it does not exists a transaction after a certain date. I'll leave it as an exercise.

回答2:

Alvin here is a simplified scenario from your fiddle with dates:

CREATE TABLE transactions 
(    id    int(11) NOT NULL AUTO_INCREMENT
,    name  varchar(30) NOT NULL
,    value datetime NOT NULL
,       PRIMARY KEY (id) ) ENGINE=InnoDB;

INSERT INTO transactions (name, value) VALUES ('alex',  '2011-01-01 12:00:00')
                                           ,  ('alex',  '2012-06-01 12:00:00');

Let's investigate what happens in:

SELECT t1.name as t1_name, t1.value as t1_value
     , t2.name as t2_name, t2.values as t2_value
FROM transactions t1
LEFT JOIN transactions t2
    ON t1.name = t2.name

T1_NAME     T1_VALUE    T2_NAME     T2_VALUE
alex    January, 01 2011 12:00:00+0000  alex    January, 01 2011 12:00:00+0000
alex    January, 01 2011 12:00:00+0000  alex    June, 01 2012 12:00:00+0000
alex    June, 01 2012 12:00:00+0000     alex    January, 01 2011 12:00:00+0000
alex    June, 01 2012 12:00:00+0000     alex    June, 01 2012 12:00:00+0000

I.e. 4 rows. If we now add the join predicate:

SELECT t1.name as t1_name, t1.value as t1_value
     , t2.name as t2_name, t2.values as t2_value
FROM transactions t1
LEFT JOIN transactions t2
    ON t1.name = t2.name    
   AND t2.value > CAST('2011-06-01 12:00.000' as DateTime)

This leaves us with two rows. If we change the time to '2012-06-01 12:00.000' we still have two rows due to the left join, but the t2 columns will be null.

If we now add the WHERE clause:

SELECT t1.name as t1_name, t1.value as t1_value
     , t2.name as t2_name, t2.values as t2_value
FROM transactions t1
LEFT JOIN transactions t2
    ON t1.name = t2.name    
   AND t2.value > CAST('2012-06-01 12:00.000' as DateTime)
WHERE t2.name is null;

we still have two rows. With CAST('2011-06-01 12:00.000' as DateTime) there are no rows.

Remember that the construction is equivalent with:

SELECT t1.name as t1_name, t1.value as t1_value
FROM transactions t1
WHERE NOT EXISTS (
    SELECT 1 FROM transactions t2
    WHERE t1.name = t2.name    
      AND t2.value > CAST('2012-06-01 12:00.000' as DateTime)
);

So, if it does not exist a row for the name where value > '2012-06-01 12:00.000' we have a match. Does that clarify?

回答3:

@Lennart, Alvin, consider the following...

DROP TABLE IF EXISTS my_table;

CREATE TABLE my_table (id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,val INT NOT NULL);

INSERT INTO my_table (val) VALUES (1),(1),(2),(1),(3),(2),(3),(1),(4);

SELECT * FROM my_table;
+----+-----+
| id | val |
+----+-----+
|  1 |   1 |
|  2 |   1 |
|  3 |   2 |
|  4 |   1 |
|  5 |   3 |
|  6 |   2 |
|  7 |   3 |
|  8 |   1 |
|  9 |   4 |
+----+-----+

Let's delete the most recent result for each val, i.e. the result of...

SELECT x.* 
  FROM my_table x 
  JOIN 
     ( SELECT val, max(id) max_id FROM my_table GROUP BY val ) y 
    ON y.val = x.val 
   AND y.max_id = x.id;
+----+-----+
| id | val |
+----+-----+
|  8 |   1 |
|  6 |   2 |
|  7 |   3 |
|  9 |   4 |
+----+-----+

So...

DELETE x 
  FROM my_table x 
  JOIN ( SELECT val, max(id) max_id FROM my_table GROUP BY val ) y 
    ON y.val = x.val 
   AND y.max_id = x.id;

SELECT * FROM my_table;
+----+-----+
| id | val |
+----+-----+
|  1 |   1 |
|  2 |   1 |
|  3 |   2 |
|  4 |   1 |
|  5 |   3 |
+----+-----+

来源：https://stackoverflow.com/questions/23935467/sql-query-based-on-subquery-retrieve-transactions-with-data-threshold

标签

mysql

sql

datetime

subquery

conditional-statements