问题
I'm testing out Cassandra (2.0) as a possible replacement for storing our time-series data.
I made a simple table and dumped some of our data into it:
CREATE TABLE DataRaw(
channelId int,
sampleTime timestamp,
value double,
PRIMARY KEY (channelId, sampleTime)
) WITH CLUSTERING ORDER BY (sampleTime ASC);
I can quite easily perform the most used queries like first value, last value (current) and get statistics via max, min, count, avg etc.
But I also need to not only fetch the max value in a range, but the sampletime where that value is.
For for the given data:
sampleTime value
2015-08-28 00:00 10
2015-08-28 01:00 15
2015-08-28 02:00 13
I'd like the query to return 2015-08-28 01:00 and 15
I tried something like this:
select sampletime, value from dataraw where channelid=865 and sampletime >= '2014-01-01 00:00' and sampleTime < '2014-01-02 00:00' and value = (select max(value) from dataraw where channelid=865 and sampletime >= '2014-01-01 00:00' and sampleTime < '2014-01-02 00:00');
which would work in "normal" SQL, but Cassandra does not like it.
Any ideas?
回答1:
You can do this type of query in Cassandra 2.2. The older 2.0 branch is outdated and has fewer query options than 2.2.
In 2.2 it looks like this:
cqlsh:test> SELECT * from dataraw ;
channelid | sampletime | value
-----------+--------------------------+-------
1 | 2015-08-28 06:20:38-0400 | 10
1 | 2015-08-28 06:20:49-0400 | 15
1 | 2015-08-28 06:20:57-0400 | 13
cqlsh:test> SELECT sampletime, max(value) FROM dataraw
WHERE channelid=1 AND sampletime >= '2015-08-28 06:20:38-0400'
AND sampletime <= '2015-08-28 06:20:57-0400';
sampletime | system.max(value)
--------------------------+-------------------
2015-08-28 06:20:38-0400 | 15
For some more background, although CQL (Cassandra Query Language) looks similar to SQL, it has a lot of restrictions on what types of queries you can do. See this.
So you have a few options:
Set up your schema and queries to work within the restrictions of CQL, sometimes relying on code in your client to do filtering/analysis on a superset of the rows you are actually interested in.
You can create UDF's (User Defined Functions) and user defined aggregate functions to do some additional work on the query co-ordinator (i.e. using in cluster resources instead of client resources).
You can pair Cassandra with Apache Spark, which can do much more complex analytics than CQL (but is somewhat complex to set up and use).
In Cassandra 3.0 there is a new feature called materialized views, which lets you define an alternate primary key for your data to support different query patterns on your dataset than would be supported by the base table. Cassandra 3.0 is currently in beta release.
Cassandra 2.2 adds the min, max, avg, and sum functions to CQL, along with user defined functions, so is more powerful than 2.0. I think over time CQL will slowly gain more SQL functionality, but some traditional SQL operations are difficult in a distributed model, and will take time to be implemented.
回答2:
Axibase Time-Series Database supports MIN_VALUE_TIME and MAX_VALUE_TIME aggregators.
- MIN_VALUE_TIME returns time in milliseconds when the MIN value was first reached within the period.
- MAX_VALUE_TIME returns time in milliseconds when the MAX value was first reached within the period.
Multiple aggregators can be combined within the same API request so you can fetch both MAX and MAX_VALUE_TIME in one go.
As for the back-end, ATSD uses HBase for raw storage.
Disclosure: I work for Axibase.
UPDATE 1: Examples on how these aggregators can be represented. Typically you would show timestamps along with MIN and MAX values respectively. This answers the question on what was the maximum and when was it reached.
UPDATE 2: SQL Console
SELECT entity,
MAX(value),
date_format(MAX_VALUE_TIME(value), 'yyyy-MM-dd HH:mm:ss') AS "Max Value Time"
FROM cpu_busy
WHERE time > current_hour GROUP BY entity
来源:https://stackoverflow.com/questions/32266927/fetching-datapoint-in-cassandra-based-on-statistics