MySQL: Splitting a large table into partitions or separate tables?

丶灬走出姿态 提交于 2021-02-05 05:56:42

问题


I have a MySQL database with over 20 tables, but one of them is significantly large because it collects measurement data from different sensors. It's size is around 145 GB on disk and it contains over 1 billion records. All this data is also being replicated to another MySQL server.

I'd like to separate the data to smaller "shards", so my question is which of the below solutions would be better. I'd use the record's "timestamp" for dividing the data by years. Almost all SELECT queries that are executed on this table contain the "timestamp" field in the "where" part of the query.

So below are the solutions that I cannot decide on:

  1. Using the MySQL partitioning and dividing the data by years (e.g. partition1 - 2010, partition2 - 2011 etc.)
  2. Creating separate tables and dividing the data by years (e.g. measuring_2010, measuring_2011 etc. tables)

Are there any other (newer) possible options that I'm not aware of?

I know that in the first case MySQL itself would get the data from the 'shards' and in the second case I'd have to write a kind of wrapper for it and do it by myself. Is there any other way for the second case that would make all separate tables to be seen as 'one big table' to fetch data from?

I know this question has already been asked in the past, but maybe somebody came up with some new solution (that I'm not aware of) or that the best practice solution changed by now. :)

Thanks a lot for your help.

Edit:

The schema is something similar to this:

device_id (INT)
timestamp (DATETIME)
sensor_1_temp (FLOAT)
sensor_2_temp (FLOAT)
etc. (30 more for instance)

All sensor temperatures are written at the same moment once a minute. Note that there around 30 different sensors measurements written in a row. This data is mostly used for displaying graphs and some other statistic purposes.


回答1:


Well, if you are hoping for a new answer, that means you have probably read my answers, and I sound like a broken record. See Partitioning blog for the few use cases where partitioning can help performance. Yours does not sound like any of the 4 cases.

Shrink device_id. INT is 4 bytes; do you really have millions of devices? TINYINT UNSIGNED is 1 byte and a range of 0..255. SMALLINT UNSIGNED is 2 bytes and a range of 0..64K. That will shrink the table a little.

If your real question is about how to manage so much data, then let's "think outside the box". Read on.

Graphing... What date ranges are you graphing?

  • The 'last' hour/day/week/month/year?
  • An arbitrary hour/day/week/month/year?
  • An arbitrary range, not tied to day/week/month/year boundaries?

What are you graphing?

  • Average value over a day?
  • Max/min over the a day?
  • Candlesticks (etc) for day or week or whatever?

Regardless of the case, you should build (and incrementally maintain) a Summary Table with data. A row would contain summary info for one hour. I would suggest

CREATE TABLE Summary (
    device_id SMALLINT UNSIGNED NOT NULL,
    sensor_id TINYINT UNSIGNED NOT NULL,
    hr TIMESTAMP NOT NULL,
    avg_val FLOAT NOT NULL,
    min_val FLOAT NOT NULL,
    max_val FLOAT NOT NULL
    PRIMARY KEY (device_id, sensor_id, hr)
) ENGINE=InnoDB;

The one Summary table might be 9GB (for current amount of data).

SELECT hr,
       avg_val,
       min_val,
       max_val
    FROM Summary
    WHERE device_id = ?
      AND sensor_id = ?
      AND hr >= ?
      AND hr  < ? + INTERVAL 20 DAY;

Would give you the hi/lo/avg values for 480 hours; enough to graph? Grabbing 480 rows from the summary table is a lot faster than grabbing 60*480 rows from the raw data table.

Getting similar data for a year would probably choke a graphing package, so it may be worth building a summary of the summary -- with resolution of a day. It would be about 0.4GB.

There are a few different ways to build the Summary table(s); we can discuss that after you have pondered its beauty and read Summary tables blog. It may be that gathering one hour's worth of data, then augmenting the Summary table, is the best way. That would be somewhat like the flip-flop discussed my Staging table blog.

And, if you had the hourly summaries, do you really need the minute-by-minute data? Consider throwing it away. Or, maybe data after, say, one month. That leads to using partitioning, but only for its benefit in deleting old data as discussed in "Case 1" of Partitioning blog. That is, you would have daily partitions, using DROP and REORGANIZE every night to shift the time of the "Fact" table. This would lead to decreasing your 145GB footprint, but without losing much data. New footprint: About 12GB (Hourly summary + last 30 days' minute-by-minute details)

PS: The Summary Table blog shows how to get standard deviation.




回答2:


You haven't said much about how you use/query the data or what the schema looks like but I try to make something up.

  1. One thing how you can split your table is based on entities (different sensors are different entities). That's useful if different sensors require different columns. So you don't need to force them into one schema that fits all of them (least common multiple). Though it's not good if sensors are added or removed dynamically since you would have to add tables at runtime.
  2. Another approach is to split the table based on time. This is the case if after some time data can be "historized" and it is not used for the actual business logic anymore but for statistic purposes.

Both approaches can also be combined. Furthermore be sure that the table is properly indexed according to your query needs.

I strongly discourage any approach that often requires adding a table after some time or anything similar. As always I wouldn't split anything before there's a performance issue.

Edit:
I would clearly restructure the table to following and not split it at all:

device_id (INT)
timestamp (DATETIME)
sensor_id (INT) -- could be unique or not. if sensor_id is not unique make a 
                -- composite key from device_id and sensor_id given that you 
                -- need it for queries
sensor_temp (FLOAT)

If data grows fast and you're expecting to generate terabytes of data soon you are better off with a NoSQL approach. But that's a different story.



来源:https://stackoverflow.com/questions/46317100/mysql-splitting-a-large-table-into-partitions-or-separate-tables

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!