Efficiently storing time series data: mySQL or flat files? Many tables (or files) or queries with WHERE condition?

后端 未结 1 378
花落未央
花落未央 2021-01-31 00:16

What\'s the best way to store time series data of thousands (but could become millions soon) real-world hardware sensors? The sensors itself are different, some just capture one

1条回答
  •  梦如初夏
    2021-01-31 00:45

    To answer this question, we must first analyse the real issue you're facing.

    The real issue would be the most efficient combination of writing and retrieving data.

    Let's review your conclusions:

    • thousands of tables - well, that violates the purpose of databases and makes it harder to work with. You also gain nothing. There is still disk seeking involved, this time with many file descriptors in use. You also have to know the table names, and there's thousands of them. It's also difficult to extract data, which is what databases are for - to structure the data in such a way that you can easily cross-reference the records. Thousands of tables - not efficient from perf. point of view. Not efficient from use point of view. Bad choice.

    • a csv file - it is probably excellent for fetching the data, if you need entire contents at once. But it's far from remotely good for manipulating or transforming the data. Given the fact you rely on a specific layout - you have to be extra careful while writing to CSV. If this grows to thousands of CSV files, you didn't do yourself a favor. You removed all the overhead of SQL (which isn't that big) but you did nothing for retrieving parts of the data set. You also have problems fetching historic data or cross referencing anything. Bad choice.

    The ideal scenario would be being able to access any part of the data set in an efficient and quick way without any kind of structure change.

    And this is exactly the reason why we use relational databases and why we dedicate entire servers with a lot of RAM to those databases.

    In your case, you are using MyISAM tables (.MYD file extension). It's an old storage format that worked great for low end hardware which was used back in the day. But these days, we have excellent and fast computers. That's why we use InnoDB and allow it to use a lot of RAM so the I/O costs are reduced. The variable in question that controls it is called innodb_buffer_pool_size - googling that will produce meaningful results.

    To answer the question - an efficient, satisfiable solution would be to use one table where you store sensor information (id, title, description) and another table where you store sensor readings. You allocate sufficient RAM or sufficiently fast storage (an SSD). The tables would look like this:

    CREATE TABLE sensors ( 
        id int unsigned not null auto_increment,
        sensor_title varchar(255) not null,
        description varchar(255) not null,
        date_created datetime,
        PRIMARY KEY(id)
    ) ENGINE = InnoDB DEFAULT CHARSET = UTF8;
    
    CREATE TABLE sensor_readings (
        id int unsigned not null auto_increment,
        sensor_id int unsigned not null,
        date_created datetime,
        reading_value varchar(255), -- note: this column's value might vary, I do not know what data type you need to hold value(s)
        PRIMARY KEY(id),
        FOREIGN KEY (sensor_id) REFERENCES sensors (id) ON DELETE CASCADE
    ) ENGINE = InnoDB DEFAULT CHARSET = UTF8;
    

    InnoDB, by default, uses one flat-file for entire database/installation. That alleviates the problem of exceeding file descriptor limit of the OS / filesystem. Several, or even tens of millions of records should not be a problem if you were to allocate 5-6 gigs of RAM to hold the working data set in memory - that would allow you quick access to the data.

    If I were to design such a system, this is the first approach I would make (personally). From there on it's easy to adjust depending on what you need to do with that information.

    0 讨论(0)
提交回复
热议问题