Fetching datapoint in cassandra based on statistics

喜夏-厌秋 提交于 2019-12-22 08:24:09

问题


I'm testing out Cassandra (2.0) as a possible replacement for storing our time-series data.

I made a simple table and dumped some of our data into it:

CREATE TABLE DataRaw(
  channelId int,
  sampleTime timestamp,
  value double,
  PRIMARY KEY (channelId, sampleTime)
) WITH CLUSTERING ORDER BY (sampleTime ASC);

I can quite easily perform the most used queries like first value, last value (current) and get statistics via max, min, count, avg etc.

But I also need to not only fetch the max value in a range, but the sampletime where that value is.

For for the given data:

sampleTime          value
2015-08-28 00:00    10
2015-08-28 01:00    15
2015-08-28 02:00    13

I'd like the query to return 2015-08-28 01:00 and 15

I tried something like this:

select sampletime, value from dataraw where channelid=865 and sampletime >= '2014-01-01 00:00' and sampleTime < '2014-01-02 00:00' and value = (select max(value) from dataraw where channelid=865 and sampletime >= '2014-01-01 00:00' and sampleTime < '2014-01-02 00:00');

which would work in "normal" SQL, but Cassandra does not like it.

Any ideas?


回答1:


You can do this type of query in Cassandra 2.2. The older 2.0 branch is outdated and has fewer query options than 2.2.

In 2.2 it looks like this:

cqlsh:test> SELECT  * from dataraw ;

 channelid | sampletime               | value
-----------+--------------------------+-------
         1 | 2015-08-28 06:20:38-0400 |    10
         1 | 2015-08-28 06:20:49-0400 |    15
         1 | 2015-08-28 06:20:57-0400 |    13

cqlsh:test> SELECT sampletime, max(value) FROM dataraw 
            WHERE channelid=1 AND sampletime >= '2015-08-28 06:20:38-0400' 
                  AND sampletime <= '2015-08-28 06:20:57-0400';

 sampletime               | system.max(value)
--------------------------+-------------------
 2015-08-28 06:20:38-0400 |                15

For some more background, although CQL (Cassandra Query Language) looks similar to SQL, it has a lot of restrictions on what types of queries you can do. See this.

So you have a few options:

  1. Set up your schema and queries to work within the restrictions of CQL, sometimes relying on code in your client to do filtering/analysis on a superset of the rows you are actually interested in.

  2. You can create UDF's (User Defined Functions) and user defined aggregate functions to do some additional work on the query co-ordinator (i.e. using in cluster resources instead of client resources).

  3. You can pair Cassandra with Apache Spark, which can do much more complex analytics than CQL (but is somewhat complex to set up and use).

  4. In Cassandra 3.0 there is a new feature called materialized views, which lets you define an alternate primary key for your data to support different query patterns on your dataset than would be supported by the base table. Cassandra 3.0 is currently in beta release.

Cassandra 2.2 adds the min, max, avg, and sum functions to CQL, along with user defined functions, so is more powerful than 2.0. I think over time CQL will slowly gain more SQL functionality, but some traditional SQL operations are difficult in a distributed model, and will take time to be implemented.




回答2:


Axibase Time-Series Database supports MIN_VALUE_TIME and MAX_VALUE_TIME aggregators.

  • MIN_VALUE_TIME returns time in milliseconds when the MIN value was first reached within the period.
  • MAX_VALUE_TIME returns time in milliseconds when the MAX value was first reached within the period.

Multiple aggregators can be combined within the same API request so you can fetch both MAX and MAX_VALUE_TIME in one go.

As for the back-end, ATSD uses HBase for raw storage.

Disclosure: I work for Axibase.

UPDATE 1: Examples on how these aggregators can be represented. Typically you would show timestamps along with MIN and MAX values respectively. This answers the question on what was the maximum and when was it reached.

UPDATE 2: SQL Console

SELECT entity, 
  MAX(value), 
  date_format(MAX_VALUE_TIME(value), 'yyyy-MM-dd HH:mm:ss') AS "Max Value Time" 
  FROM cpu_busy 
WHERE time > current_hour GROUP BY entity



来源:https://stackoverflow.com/questions/32266927/fetching-datapoint-in-cassandra-based-on-statistics

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!