Optimize MySQL self-join query

问题

I have c_regs table that contains duplicate rows. I've created index on form_number and property_name columns. Unfortunately this query still taking to-o-o-o long to complete, especially with addition of t10 and t11 joins. Is there a way to optimize it? Thanks.

select 
    ifnull(x.form_datetime,'') reg_date,
    ifnull(x.property_value,'') amg_id,
    x.form_number,
    x.form_name,
    x.form_version,
    ifnull(t1.property_value,'') first_name,
    ifnull(t2.property_value,'') last_name,
    ifnull(t3.property_value,'') address, 
    ifnull(t4.property_value,'') address_2,
    ifnull(t5.property_value,'') city,
    ifnull(t6.property_value,'') state_code,
    ifnull(t7.property_value,'') zip,
    ifnull(t8.property_value,'') phone,
    ifnull(t9.property_value,'') email,
    ifnull(t10.property_value,'') registrant_type,
    t11.property_value auth_type_code
from 
    (select distinct form_datetime, form_number, form_name, form_version, property_value  from c_regs where property_name = 'field.frm_personID') as x
    inner join (select distinct * from c_regs) as t1 on t1.form_number = x.form_number and t1.property_name = 'field.frm_firstName'
    inner join (select distinct * from c_regs) as t2 on t2.form_number = x.form_number and t2.property_name = 'field.frm_lastName'
    inner join (select distinct * from c_regs) as t3 on t3.form_number = x.form_number and t3.property_name = 'field.frm_address'
    left join (select distinct * from c_regs) as t4 on t4.form_number = x.form_number and t4.property_name = 'field.frm_address2'
    inner join (select distinct * from c_regs) as t5 on t5.form_number = x.form_number and t5.property_name = 'field.frm_city'
    inner join (select distinct * from c_regs) as t6 on t6.form_number = x.form_number and t6.property_name = 'field.frm_state'
    inner join (select distinct * from c_regs) as t7 on t7.form_number = x.form_number and t7.property_name = 'field.frm_zip'
    inner join (select distinct * from c_regs) as t8 on t8.form_number = x.form_number and t8.property_name = 'field.frm_phone'
    inner join (select distinct * from c_regs) as t9 on t9.form_number = x.form_number and t9.property_name = 'field.frm_emailAddress'
    left join (select distinct * from c_regs) as t10 on t10.form_number = x.form_number and t10.property_name = 'field.frm_youAre'
    inner join (select distinct * from c_regs) as t11 on t11.form_number = x.form_number and t11.property_name = 'field.frm_authType'
;

回答1:

You should not use SELECT DISTINCT all the time. Keep in mind that DISTINCT is bound to be a no-op if you have any unique constraints in your select-list, so there's probably no need. If there are duplicates, DISTINCT is costly because it sorts the table so duplicates are arranged together to be de-duped.

You also shouldn't do lots of self-joins for this kind of data. Each of your subqueries in your self-join are reading the whole table.

SELECT form_number,
  MAX(form_datetime) AS reg_date,
  MAX(form_name) AS form_name,
  MAX(form_version) AS form_version,
  MAX(CASE property_name WHEN 'field.frm_personID' THEN property_value END) AS amg_id,
  MAX(CASE property_name WHEN 'field.frm_firstName' THEN property_value END) AS first_name,
  MAX(CASE property_name WHEN 'field.frm_lastName' THEN property_value END) AS last_name,
  MAX(CASE property_name WHEN 'field.frm_address' THEN property_value END) AS address,
  MAX(CASE property_name WHEN 'field.frm_address2' THEN property_value END) AS address_2,
  MAX(CASE property_name WHEN 'field.frm_city' THEN property_value END) AS city,
  MAX(CASE property_name WHEN 'field.frm_state' THEN property_value END) AS state_code,
  MAX(CASE property_name WHEN 'field.frm_zip' THEN property_value END) AS zip,
  MAX(CASE property_name WHEN 'field.frm_phone' THEN property_value END) AS phone,
  MAX(CASE property_name WHEN 'field.frm_emailAddress' THEN property_value END) AS email,
  MAX(CASE property_name WHEN 'field.frm_youAre' THEN property_value END) AS registrant_type,
  MAX(CASE property_name WHEN 'field.frm_authType' THEN property_value END) AS auth_type_code
FROM c_regs
GROUP BY form_number;

Explanation: The GROUP BY causes all rows for a given form_number to be treated as one group, and the result will have one row per group.

All other columns that are not named in the GROUP BY must be in grouping functions. I chose MAX(). I assume there should be only one distinct value per group for the form datetime, name, and version.

For the properties, we put an expression inside the MAX() function to return the value only on rows where the property has a certain value. On other rows, the expression is NULL, which MAX() will ignore.

In this way, you get the result you want without having to do any self-joins or DISTINCT modifiers either. The query scans through the table just once, and it should be must faster.

回答2:

BK's proposition that lots of self joins are harmful is misleading.

Consider an EAV data set comprising of 10,000 entities, each with 12 attributes as follows:

DROP TABLE IF EXISTS my_table;

CREATE TABLE my_table
(entity INT NOT NULL
,attribute INT NOT NULL
,value INT NOT NULL
,PRIMARY KEY(entity,attribute)
);

INSERT INTO my_table VALUES
(1,101,RAND()*100),
(1,102,RAND()*100),
(1,103,RAND()*100),
(1,104,RAND()*100),
(1,105,RAND()*100),
(1,106,RAND()*100),
(1,107,RAND()*100),
(1,108,RAND()*100),
(1,109,RAND()*100),
(1,110,RAND()*100),
(1,111,RAND()*100),
(1,112,RAND()*100);

With this initial seed, I can use a table of integers (0-9) to rapidly populate the rest of the table...

INSERT IGNORE INTO my_table SELECT i4.i*1000+i3.i*100+i2.i*10+i1.i+1, attribute, RAND()*100 FROM my_table,ints i1, ints i2, ints i3, ints i4;

Bill's query...

SELECT SQL_NO_CACHE a.entity
     , MAX(CASE WHEN attribute = 101 THEN value END) x101
     , MAX(CASE WHEN attribute = 102 THEN value END) x102
     , MAX(CASE WHEN attribute = 103 THEN value END) x103
     , MAX(CASE WHEN attribute = 104 THEN value END) x104
     , MAX(CASE WHEN attribute = 105 THEN value END) x105
     , MAX(CASE WHEN attribute = 106 THEN value END) x106
     , MAX(CASE WHEN attribute = 107 THEN value END) x107
     , MAX(CASE WHEN attribute = 108 THEN value END) x108
     , MAX(CASE WHEN attribute = 109 THEN value END) x109
     , MAX(CASE WHEN attribute = 110 THEN value END) x110
     , MAX(CASE WHEN attribute = 111 THEN value END) x111
     , MAX(CASE WHEN attribute = 112 THEN value END) x112
  FROM my_table a
 GROUP 
    BY a.entity;

+--------+------+------+------+------+------+------+------+------+------+------+------+------+
| entity | x101 | x102 | x103 | x104 | x105 | x106 | x107 | x108 | x109 | x110 | x111 | x112 |
+--------+------+------+------+------+------+------+------+------+------+------+------+------+
|      1 |   78 |    8 |    4 |   95 |   66 |   43 |   16 |   51 |    9 |   89 |   20 |   33 |
...
|   9998 |   61 |   72 |   67 |   20 |   23 |   10 |   31 |   37 |   69 |   18 |   24 |   32 |
|   9999 |   67 |   91 |   32 |   58 |   77 |   81 |   61 |   22 |   75 |   65 |   91 |   42 |
|  10000 |   52 |   38 |   56 |   32 |   14 |   77 |   10 |   99 |   70 |   70 |   82 |   13 |
+--------+------+------+------+------+------+------+------+------+------+------+------+------+    
10000 rows in set (0.20 sec)

The alternative...

SELECT SQL_NO_CACHE a.entity
     , a.value x101
     , b.value x102
     , c.value x103
     , d.value x104
     , e.value x105
     , f.value x106
     , g.value x107
     , h.value x108
     , i.value x109
     , j.value x110
     , k.value x111
     , l.value x112
  FROM my_table a
  LEFT JOIN my_table b ON b.entity = a.entity  AND b.attribute = 102
  LEFT JOIN my_table c ON c.entity = a.entity  AND c.attribute = 103
  LEFT JOIN my_table d ON d.entity = a.entity  AND d.attribute = 104
  LEFT JOIN my_table e ON e.entity = a.entity  AND e.attribute = 105
  LEFT JOIN my_table f ON f.entity = a.entity  AND f.attribute = 106
  LEFT JOIN my_table g ON g.entity = a.entity  AND g.attribute = 107
  LEFT JOIN my_table h ON h.entity = a.entity  AND h.attribute = 108
  LEFT JOIN my_table i ON i.entity = a.entity  AND i.attribute = 109
  LEFT JOIN my_table j ON j.entity = a.entity  AND j.attribute = 110
  LEFT JOIN my_table k ON k.entity = a.entity  AND k.attribute = 111
  LEFT JOIN my_table l ON l.entity = a.entity  AND l.attribute = 112
  WHERE a.attribute = 101;

+--------+------+------+------+------+------+------+------+------+------+------+------+------+
| entity | x101 | x102 | x103 | x104 | x105 | x106 | x107 | x108 | x109 | x110 | x111 | x112 |
+--------+------+------+------+------+------+------+------+------+------+------+------+------+
|      1 |   78 |    8 |    4 |   95 |   66 |   43 |   16 |   51 |    9 |   89 |   20 |   33 |
...
|   9998 |   61 |   72 |   67 |   20 |   23 |   10 |   31 |   37 |   69 |   18 |   24 |   32 |
|   9999 |   67 |   91 |   32 |   58 |   77 |   81 |   61 |   22 |   75 |   65 |   91 |   42 |
|  10000 |   52 |   38 |   56 |   32 |   14 |   77 |   10 |   99 |   70 |   70 |   82 |   13 |
+--------+------+------+------+------+------+------+------+------+------+------+------+------+
10000 rows in set (0.23 sec)

So, Bill's query IS fractionally faster. However, as soon as you reduce the number of entities sought (while maintaining the same number of attributes - so same number of joins), the alternative query can overtake Bill's by something approaching the same kind of margin...

Bill's query with WHERE a.entity <= 5000 added

  |   4998 |   59 |   55 |   93 |   48 |   72 |   32 |   38 |   36 |    6 |   82 |   23 |   62 |
  |   4999 |   23 |   10 |   11 |   29 |   69 |   67 |   92 |   72 |   25 |   49 |   79 |   48 |
  |   5000 |   39 |   86 |   77 |    0 |   30 |   38 |   48 |   54 |    9 |   97 |   25 |   54 |
  +--------+------+------+------+------+------+------+------+------+------+------+------+------+
 5000 rows in set (0.12 sec)

The alternative with WHERE a.entity <= 5000 added

  |   4998 |   59 |   55 |   93 |   48 |   72 |   32 |   38 |   36 |    6 |   82 |   23 |   62 |
  |   4999 |   23 |   10 |   11 |   29 |   69 |   67 |   92 |   72 |   25 |   49 |   79 |   48 |
  |   5000 |   39 |   86 |   77 |    0 |   30 |   38 |   48 |   54 |    9 |   97 |   25 |   54 |
  +--------+------+------+------+------+------+------+------+------+------+------+------+------+
 5000 rows in set (0.11 sec)

So it isn't really the number of joins, but rather the assiduous use of indexes that makes the difference between a slow and a fast query.

回答3:

You don't need all those joins. With my optimization, the data will return in rows rather than columns as you have them.

(i did not run this, so test it first)

SELECT 
    ifnull(x.form_datetime,'') reg_date,
    ifnull(x.property_value,'') amg_id,
    x.form_number,
    x.form_name,
    x.form_version,
    x.property_name,
    x.property_value
FROM c_regs x
WHERE x.property_name IN (
    'field.frm_firstName',
    'field.frm_lastName',
    'field.frm_address',
    ...
)
AND x.form_number = 'the form id'
GROUP BY x.form_number, x.property_name
ORDER BY x.form_number ASC;

The AND is only necessary if you want a particular form, not all of them. (which I would suggest)

Also ask your self the question: do you need to have the field names in the condition? You can use my query as a sub-query and then merge each field as column like you had before, without another join.

回答4:

Try to add union clause at your code

    SELECT ID, NAME, AMOUNT, DATE
FROM CUSTOMERS
LEFT JOIN ORDERS
ON CUSTOMERS.ID = ORDERS.CUSTOMER_ID
UNION
SELECT ID, NAME, AMOUNT, DATE
FROM CUSTOMERS
RIGHT JOIN ORDERS
ON CUSTOMERS.ID = ORDERS.CUSTOMER_ID;

回答5:

This is awful:

inner join (select distinct * from c_regs) as t7
       on t7.form_number = x.form_number and t7.property_name = 'field.frm_zip'

It scans the entire c_regs table, removing duplicate rows, and copying the de-dupped rows into a temp table with no indexes. Then it rummages in it for what might (or might not) be one row.

Note that the DISTINCT does not guarantee that at most one row will be returned. (I will ignore the multi-row issue.)

It would be much better to do

inner join c_regs AS t7 ON
        t7.form_number = x.form_number and t7.property_name = 'field.frm_zip'

But that also needs INDEX(form_number, property_name). Even better would be to have the PRIMARY KEY start with those two columns, as discussed here: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#speeding_up_wp_postmeta

Meanwhile, there is no need to the extra layer of SELECT right after the first FROM.

Meanwhile, you should embark on getting rid of dups in c_regs, and preventing their return! A suitable natural PRIMARY KEY is likely to solve the problem. (Again, see my link.)

来源：https://stackoverflow.com/questions/41905375/optimize-mysql-self-join-query

标签

mysql

query-optimization

self-join