Storing time-series data, relational or non?

后端 未结 10 1860
栀梦
栀梦 2020-11-28 17:04

I am creating a system which polls devices for data on varying metrics such as CPU utilisation, disk utilisation, temperature etc. at (probably) 5 minute intervals using SNM

相关标签:
10条回答
  • 2020-11-28 17:49

    You table has data in single table. So relational vs non relational is not the question. Basically you need to read a lot of sequential data. Now if you have enough RAM to store a years worth data then nothing like using Redis/MongoDB etc.

    Mostly NoSQL databases will store your data on same location on disk and in compressed form to avoid multiple disk access.

    NoSQL does the same thing as creating the index on device id and metric id, but in its own way. With database even if you do this the index and data may be at different places and there would be a lot of disk IO.

    Tools like Splunk are using NoSQL backends to store time series data and then using map reduce to create aggregates (which might be what you want later). So in my opinion to use NoSQL is an option as people have already tried it for similar use cases. But will a million rows bring the database to crawl (maybe not , with decent hardware and proper configurations).

    0 讨论(0)
  • 2020-11-28 17:52

    Definitely Relational. Unlimited flexibility and expansion.

    Two corrections, both in concept and application, followed by an elevation.

    Correction

    1. It is not "filtering out the un-needed data"; it is selecting only the needed data. Yes, of course, if you have an Index to support the columns identified in the WHERE clause, it is very fast, and the query does not depend on the size of the table (grabbing 1,000 rows from a 16 billion row table is instantaneous).

    2. Your table has one serious impediment. Given your description, the actual PK is (Device, Metric, DateTime). (Please don't call it TimeStamp, that means something else, but that is a minor issue.) The uniqueness of the row is identified by:

         (Device, Metric, DateTime)
      
      • The Id column does nothing, it is totally and completely redundant.

        • An Id column is never a Key (duplicate rows, which are prohibited in a Relational database, must be prevented by other means).
        • The Id column requires an additional Index, which obviously impedes the speed of INSERT/DELETE, and adds to the disk space used.

        • You can get rid of it. Please.

    Elevation

    1. Now that you have removed the impediment, you may not have recognised it, but your table is in Sixth Normal Form. Very high speed, with just one Index on the PK. For understanding, read this answer from the What is Sixth Normal Form ? heading onwards.

      • (I have one index only, not three; on the Non-SQLs you may need three indices).

      • I have the exact same table (without the Id "key", of course). I have an additional column Server. I support multiple customers remotely.

        (Server, Device, Metric, DateTime)

      The table can be used to Pivot the data (ie. Devices across the top and Metrics down the side, or pivoted) using exactly the same SQL code (yes, switch the cells). I use the table to erect an unlimited variety of graphs and charts for customers re their server performance.

      • Monitor Statistics Data Model.
        (Too large for inline; some browsers cannot load inline; click the link. Also that is the obsolete demo version, for obvious reasons, I cannot show you commercial product DM.)

      • It allows me to produce Charts Like This, six keystrokes after receiving a raw monitoring stats file from the customer, using a single SELECT command. Notice the mix-and-match; OS and server on the same chart; a variety of Pivots. Of course, there is no limit to the number of stats matrices, and thus the charts. (Used with the customer's kind permission.)

      • Readers who are unfamiliar with the Standard for Modelling Relational Databases may find the IDEF1X Notation helpful.

    One More Thing

    Last but not least, SQL is a IEC/ISO/ANSI Standard. The freeware is actually Non-SQL; it is fraudulent to use the term SQL if they do not provide the Standard. They may provide "extras", but they are absent the basics.

    0 讨论(0)
  • 2020-11-28 17:56

    I think that the answer for this kind of question should mainly revolve about the way your Database utilize storage. Some Database servers use RAM and Disk, some use RAM only (optionally Disk for persistency), etc. Most common SQL Database solutions are using memory+disk storage and writes the data in a Row based layout (every inserted raw is written in the same physical location). For timeseries stores, in most cases the workload is something like: Relatively-low interval of massive amount of inserts, while reads are column based (in most cases you want to read a range of data from a specific column, representing a metric)

    I have found Columnar Databases (google it, you'll find MonetDB, InfoBright, parAccel, etc) are doing terrific job for time series.

    As for your question, which personally I think is somewhat invalid (as all discussions using the fault term NoSQL - IMO): You can use a Database server that can talk SQL on one hand, making your life very easy as everyone knows SQL for many years and this language has been perfected over and over again for data queries; but still utilize RAM, CPU Cache and Disk in a Columnar oriented way, making your solution best fit Time Series

    0 讨论(0)
  • 2020-11-28 17:59

    This is a problem we've had to solve at ApiAxle. We wrote up a blog post on how we did it using Redis. It hasn't been out there for very long but it's proving to be effective.

    I've also used RRDTool for another project which was excellent.

    0 讨论(0)
提交回复
热议问题