问题
I'm batching CSV
15GB (30mio rows) into a mysql-8
database.
Problem: the task takes about 20min, with approxy throughput of 15-20 MB/s. While the harddrive is capable of transfering files with 150 MB/s.
I have a RAM disk of 20GB, which holds my csv. Import as follows:
mysqlimport --user="root" --password="pass" --local --use-threads=8 mytable /tmp/mydata.csv
This uses LOAD DATA
under the hood.
My target table does not have any indexes, but approx 100 columns (I cannot change this).
What is strange: I tried tweaking several config parameters as follows in /etc/mysql/my.cnf
, but they did not give any significant improvement:
log_bin=OFF
skip-log-bin
innodb_buffer_pool_size=20G
tmp_table_size=20G
max_heap_table_size=20G
innodb_log_buffer_size=4M
innodb_flush_log_at_trx_commit=2
innodb_doublewrite=0
innodb_autoinc_lock_mode=2
Question: does LOAD DATA
/ mysqlimport
respect those config changes? Or does it bypass? Or did I use the correct configuration file at all?
At least a select on the variables shows they are correctly loaded by the mysql server. For example show variables like 'innodb_doublewrite'
shows OFF
Anyways, how could I improve import speed further? Or is my database the bottleneck and there is no way to overcome the 15-20 MB/s threshold?
Update: Interestingly if I import my csv from harddrive into the ramdisk, performance is almost the same (just a little bit better, but never over 25 MB/s). I also tested the same amount of rows, but only with a few (5) columns. And there I'm getting to about 80 MB/s. So clearly the number of columns is the bottleneck? But why do more columns slow down this process?
回答1:
MySQL/MariaDB engine have little parallelization when making bulk inserts. It can only use one CPU core per LOAD DATA
statement. You may probably monitor CPU utilization during load to see one core is fully utilized and it can provide only so much of output data - thus leaving disk throughput underutilized.
The most recent version of MySQL has new parallel load feature: https://dev.mysql.com/doc/mysql-shell/8.0/en/mysql-shell-utilities-parallel-table.html . It looks promising but probably hasn't received much feedback yet. I'm not sure it would help in your case.
I saw various checklists on the internet that recommended having higher values in the following config params: log_buffer_size
, log_file_size
, write_io_threads
, bulk_insert_buffer_size
. But the benefits were not very pronounced when I performed comparison tests (maybe 10-20% faster than just innodb_buffer_pool_size
being large enough).
回答2:
This could be normal. Let's walk through what is being done:
- The csv file is being read from a RAM disk, so no IOPs are being used.
- Are you using InnoDB? If so, the data is going into the buffer_pool. As blocks are being built there, they are being marked 'dirty' for eventual flushing to disk.
- Since the buffer_pool is large, but probably not as large as the table will become, some of the blocks will need to be flushed before it finishes reading all the data.
- After all the data is read, and the table is finished, the dirty blocks will gradually be flushed to disk.
- If you had non-unique indexes, they would similarly be written in a delayed manner to disk (cf 'Change buffering'). The change_buffer, by default occupies 25% of the buffer_pool.
How large is the resulting table? It may be significantly larger, or even smaller, than the 15GB of the csv file.
How much time did it take to bring the csv file into the ram disk? I proffer that that was wasted time and it should have been read from disk while doing the LOAD DATA
; that I/O can be overlapped.
Please SHOW GLOBAL VARIABLES LIKE 'innodb%';
; there are several others that may be relevant.
More
These are terrible:
tmp_table_size=20G
max_heap_table_size=20G
If you have a complex query, 20GB could be allocated in RAM, possibly multiple times!. Keep those to under 1% of RAM.
If copying the csv from hard disk to ram disk runs slowly, I would suspect the validity of 150 MB/s.
If you are loading the table once every 6 hours, and it takes 1/3 of an hour to perform, I don't see the urgency of making it faster. OTOH, there may be something worth looking into. If that 20 minutes is downtime due to the table being locked, that can be easily eliminated:
CREATE TABLE t LIKE real_table;
LOAD DATA INFILE INTO t ...; -- not blocking anyone
RENAME TABLE real_table TO old, t TO real_table; -- atomic; fast
DROP TABLE old;
来源:https://stackoverflow.com/questions/58305653/improve-performance-of-mysql-load-data-mysqlimport