storing massive ordered time series data in bigtable derivatives

為{幸葍}努か 提交于 2019-11-29 18:48:52
Gotys

I am not an expert yet, but I've been playing with Cassandra for a few days now, and I have some answers for you:

  1. Don't worry about amount of data, it's irrelevant with systems like Cassandra, if you have $$$ for a large hardware cluster.

Some of these systems (Cassandra, for example) claims to be able to do range queries. Would I be able to efficiently query, say, all values for MSFT, for a given day, between 11:00 am and 1:30 pm ?

Cassandra is very useful when you know how to work with keys. It can swift through keys very quickly. So to search for MSFT between 11:00 and 1:30pm, you'd have to key your rows like this:

MSFT-timestamp, GOOG-timestamp , ..etc Then you can tell Cassandra to find all keys that start with MSFT-now and end with MSFT-now+1hour.

What if I want to search across ALL symbols for a given day, and request all symbols that have a price between $10 and $10.25 (so I'm searching the values, and want keys returned as a result)?

I am not an expert, but so far I realized that Cassandra doesn't' search by values at all. So if you want to do the above, you will have to make another table dedicated just to this problem and design your schema to fit the case. But it won't be much different from what I described above. It's all about naming your keys and columns. Cassandra can find them very quickly!

What if I want to get two times series, subtract one from the other, and return the two times series and their result, will I have to do his logic in my own program?

Correct, all logic is done inside your program. This is not MySQL. This is just a storage engine. (But I am sure the next versions will offer these sort of things)

Please remember, that I am a novice at this, if I am wrong, feel free to correct me.

If you're dealing with a massive time series database, then the standards are:

These aren't cheap, but they can handle your data very efficiently.

Someone whom I respect recommended the Open Time Series Database. In particular, that the schema was the nicest he had ever seen.

http://opentsdb.net/

'Am standing in front of the same mountain. My main problem with cassandra is that I cannot get a stream on the result set, for example in the form of an iterator.

I am looking already up and down the docs and the net, but nothing.

I can't fetch all the keys and then get the rows as billions of rows makes this impossible.

The DataStax Java Driver allows for automatic paging so that will stream the results just like an iterator and it's all built in. This is in Cassandra 2.0.1 by the way - http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0

Just for the sake of completeness reading this in 2018, there is now a special database just for timeseries data called TimescaleDB

http://www.timescale.com/

This blog is worth reading, it explains why it´s superior to solutions like Cassandra for that special case and why they decided to build it on top of the relational PostgreSQL database

https://blog.timescale.com/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!