Update column from another table in large mysql db (7 million rows)

一个人想着一个人 提交于 2019-12-12 10:09:18

问题


Description

I have 2 tables with the following structure (irrelevant columns removed):

mysql> explain parts;
+-------------+--------------+------+-----+---------+-------+
| Field       | Type         | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+-------+
| code        | varchar(32)  | NO   | PRI | NULL    |       |
| slug        | varchar(255) | YES  |     | NULL    |       |
| title       | varchar(64)  | YES  |     | NULL    |       |
+-------------+--------------+------+-----+---------+-------+
4 rows in set (0.00 sec)

and

mysql> explain details;
+-------------------+--------------+------+-----+---------+-------+
| Field             | Type         | Null | Key | Default | Extra |
+-------------------+--------------+------+-----+---------+-------+
| sku               | varchar(32)  | NO   | PRI | NULL    |       |
| description       | varchar(700) | YES  |     | NULL    |       |
| part_code         | varchar(32)  | NO   | PRI |         |       |
+-------------------+--------------+------+-----+---------+-------+
3 rows in set (0.00 sec)

Table parts contains 184147 rows, and details contains 7278870 rows. The part_code column from details represents the code column from the parts table. Since these columns are varchar, I want to add the column id int(11) to parts, and part_id int(11) to details. I tried this:

mysql> alter table parts drop primary key;
Query OK, 184147 rows affected (0.66 sec)
Records: 184147  Duplicates: 0  Warnings: 0

mysql> alter table parts add column
       id int(11) not null auto_increment primary key first;
Query OK, 184147 rows affected (0.55 sec)
Records: 184147  Duplicates: 0  Warnings: 0

mysql> select id, code from parts limit 5;
+----+-------------------------+
| id | code                    |
+----+-------------------------+
|  1 | Yhk0KqSMeLcfH1KEfykihQ2 |
|  2 | IMl4iweZdmrBGvSUCtMCJA2 |
|  3 | rAKZUDj1WOnbkX_8S8mNbw2 |
|  4 | rV09rJ3X33-MPiNRcPTAwA2 |
|  5 | LPyIa_M_TOZ8655u1Ls5mA2 |
+----+-------------------------+
5 rows in set (0.00 sec)

So now I have the id column with correct data in parts table. After adding part_id column to details table:

mysql> alter table details add column part_id int(11) not null after part_code;
Query OK, 7278870 rows affected (1 min 17.74 sec)
Records: 7278870  Duplicates: 0  Warnings: 0

Now the big problem is how to update part_id accordingly? The following query:

mysql> update details d
       join parts p on d.part_code = p.code
       set d.part_id = p.id;

was running for about 30 hours until I killed it.

Note that both tables are MyISAM:

mysql> select engine from information_schema.tables where table_schema = 'db_name' and (table_name = 'parts' or table_name = 'details');
+--------+
| ENGINE |
+--------+
| MyISAM |
| MyISAM |
+--------+
2 rows in set (0.01 sec)

I just now realized that one of the problems was that dropping the key on parts table I dropped the index on the code column. On the other side, I have the following indexes on details table (some irrelevant columns are omitted):

mysql> show indexes from details;
+---------+------------+----------+--------------+-------------+-----------+-------------+------------+
| Table   | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Index_type |
+---------+------------+----------+--------------+-------------+-----------+-------------+------------+
| details |          0 | PRIMARY  |            1 | sku         | A         |        NULL | BTREE      |
| details |          0 | PRIMARY  |            3 | part_code   | A         |     7278870 | BTREE      |
+---------+------------+----------+--------------+-------------+-----------+-------------+------------+
2 rows in set (0.00 sec)

My questions are:

  1. Is the update query OK or it can be optimized somehow?
  2. I will add the index on the code column in parts table, will the query run in a reasonable time, or it will run for days again?
  3. How can I make a (sql/bash/php) script so I can see the progress of the query execution?

Thank you very much!


回答1:


As I mentioned in the question, I forgot about the dropped indexes on the parts table, so I added them:

alter table parts add key code (code);

Inspired by Puggan Se's answer, I tried to use a LIMIT on UPDATE in a PHP script, but LIMIT can't be used with an UPDATE with JOIN in MySQL. To limit the query I added a new column to the details table:

# drop the primary key,
alter table details drop primary key;
# so I can create an auto_increment column
alter table details add id int not null auto_increment primary key;
# alter the id column and remove the auto_increment
alter table details change id id int not null;
# drop again the primary key
alter table details drop primary key;
# add new indexes
alter table details add primary key ( id, sku, num, part_code );

Now I can use the "limit":

update details d
join parts p on d.part_code = p.code
set d.part_id = p.id
where d.id between 1 and 5000;

So here's the full PHP script:

$started = time();
$i = 0;
$total = 7278870;

echo "Started at " . date('H:i:s', $started) . PHP_EOL;

function timef($s){
    $h = round($s / 3600);
    $h = str_pad($h, 2, '0', STR_PAD_LEFT);
    $s = $s % 3600;
    $m = round( $s / 60);
    $m = str_pad($m, 2, '0', STR_PAD_LEFT);
    $s = $s % 60;
    $s = str_pad($s, 2, '0', STR_PAD_LEFT);
    return "$h:$m:$s";
}

while (1){
    $i++;
    $j = $i * 5000;
    $k = $j + 4999;
    $result = mysql_query("
        update details d
        join parts p on d.part_code = p.code
        set d.part_id = p.id
        where d.id between $j and $k
    ");
    if(!$result) die(mysql_error());
    if(mysql_affected_rows() == 0) die(PHP_EOL . 'Done!');
    $p = round(($i * 5000) / $total, 4) * 100;
    $s = time() - $started;
    $ela = timef($s);
    $eta = timef( (( $s / $p ) * 100) - $s );
    $eq = floor($p/10);
    $show_gt = ($p == 100);
    $spaces = $show_gt ? 9 - $eq : 10 - $eq;
    echo "\r {$p}% | [" . str_repeat('=', $eq) . ( $show_gt ? '' : '>' ) . str_repeat(' ', $spaces) . "] | Elapsed: ${ela} | ETA: ${eta}";
}

And here's a screenshot:

As you can see, the whole thing took less than 5 minutes :) Thank you all!

P.S.: There's still a bug because I found later 4999 rows left with part_id = 0, but I did that manually already.




回答2:


  1. You may want to add a where and a limit, so you can update it in chunks

    update details d
    join parts p on d.part_code = p.code
    set d.part_id = p.id
    WHERE d.part_id =0
    LIMIT 5000;
    
  2. it will be alot faster whit index, and if you do one query as sugesten in '1' above, you can se how long 5000 rows takes to handle

  3. loop above query

    while(TRUE)
    {
        $result = mysql_query($query);
        if(!$result) die('Failed: ' . mysql_error());
        if(mysql_affected_rows() == 0) die('Done');
        echo '.';
    }
    

EDIT 1 rewrite the query do to limit error on joins

You can use a subquery to avoid the multiple tables update:

UPDATE details
SET part_id = (SELECT id FROM parts WHERE parts.code = details.part_code)
WHERE part_id = 0
LIMIT 5000;



回答3:


You can try to remove the indexes form the table you're trying to update. MySQL recreates the indexes on each row update. It won't be blazing fast for 7M records.



来源:https://stackoverflow.com/questions/11430362/update-column-from-another-table-in-large-mysql-db-7-million-rows

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!