Can anyone tell me how adding a key scales in MySQL? I have 500,000,000 rows in a database, trans, with columns i (INT UNSIGNED), j (INT UNSIGNED), nu (DOUBLE), A (DOUBLE). I tr
From my experience: if the hardware can cope with it, indexing large tables with MySQL usually scales pretty linearly. I have tried it with tables of about 100,000,000 rows so far, but not on a notebook - mainly on strong servers.
I guess it depends mainly on hardware factors, the kind of table engine you're using (MyIsam, INNO, or whatever) and a bit if the table is otherwise in use in between. When I was doing it, usually disk usage jumped sky high, unlike CPU usage. Not sure about the hard disks of the MacBook, but I guess they aren't the fastest around.
If you're having MyISAM tables, maybe have a closer look at the index files in the table directory and see how it changes over the time.
Firstly, your table definition could make a big difference here. If you don't need NULL
values in your columns, define them NOT NULL
. This will save space in the index, and presumably time while creating it.
CREATE TABLE x (
i INTEGER UNSIGNED NOT NULL,
j INTEGER UNSIGNED NOT NULL,
nu DOUBLE NOT NULL,
A DOUBLE NOT NULL
);
As for the time taken to create the indexes, this requires a table scan and will show up as REPAIR BY SORTING
. It should be quicker in your case (i.e. massive data set) to create a new table with the required indexes and insert the data into it, as this will avoid the REPAIR BY SORTING
operation as the indexes are built sequentially on the insert. There is a similar concept explained in this article.
CREATE DATABASE trans_clone;
CREATE TABLE trans_clone.trans LIKE originalDB.trans;
ALTER TABLE trans_clone.trans ADD KEY idx_A (A);
Then script the insert into chunks (as per the article), or dump the data using MYSQLDUMP
:
mysqldump originalDB trans --extended-insert --skip-add-drop-table --no-create-db --no-create-info > originalDB .trans.sql
mysql trans_clone < originalDB .trans.sql
This will insert the data, but will not require an index rebuild (the index is built as each row is inserted) and should complete much faster.
There are a couple of factors to consider:
Since the factor is about 30 in size, the nominal sort time for the big data set would be of the order of 50 times as long - under two hours. However, you need 8 bytes per data value and about another 8 bytes of overhead (that's a guess - tune to mySQL if you know more about what it stores in an index). So, 14M × 16 ≈ 220 MB main memory. But 500M × 16 ≈ 8 GB main memory. Unless your machine has that much memory to spare (and MySQL is configured to use it), then the big sort is spilling to disk and that accounts for a lot of the rest of the time.
So theorically if Sorting step is a N.log(N) operation, partitionning your big table would save time on operation
About 30 % gain for a table of 500 000 000 rows partitionned in 100 equal files : because 500 000 000* log(500 000 000)= 4 349 485 002 and 100 *(500 000 000/100*LOG(500 000 000/100)) = 3 349 485 002