Best of breed indexing data structures for Extremely Large time-series

后端 未结 3 1274
别那么骄傲
别那么骄傲 2021-01-30 14:46

I\'d like to ask fellow SO\'ers for their opinions regarding best of breed data structures to be used for indexing time-series (aka column-wise data, aka flat linear).

T

相关标签:
3条回答
  • 2021-01-30 14:57

    It is going to be really time consuming and complicated to implement this by your self. I recommend you use Cassandra. Cassandra can give you horizontal scalability, redundancy and allow you to run complicated map reduce functions in future. To learn how to store time series in cassandra please take a look at: http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra and http://www.youtube.com/watch?v=OzBJrQZjge0.

    0 讨论(0)
  • 2021-01-30 15:03

    General ideas:

    Problem 1 is fairly common: Create an index that fits into your RAM and has links to the data on the secondary storage (datastructure: B-Tree family). Problem 2 / 3 are quite complicated since your data is so large. You could partition your data into time ranges and calculate the min / max for that time range. Using that information, you can filter out time ranges (e.g. max value for a range is 50 and you search for v0>60 then the interval is out). The rest needs to be searched by going through the data. The effectiveness greatly depends on how fast the data is changing.

    You can also do multiple indices by combining the time ranges of lower levels to do the filtering faster.

    0 讨论(0)
  • 2021-01-30 15:12

    You would probably want to use some type of large, balanced tree. Like Tobias mentioned, B-trees would be the standard choice for solving the first problem. If you also care about getting fast insertions and updates, there is a lot of new work being done at places like MIT and CMU into these new "cache oblivious B-trees". For some discussion of the implementation of these things, look up Tokutek DB, they've got a number of good presentations like the following:

    http://tokutek.com/downloads/mysqluc-2010-fractal-trees.pdf

    Questions 2 and 3 are in general a lot harder, since they involve higher dimensional range searching. The standard data structure for doing this would be the range tree (which gives O(log^{d-1}(n)) query time, at the cost of O(n log^d(n)) storage). You generally would not want to use a k-d tree for something like this. While it is true that kd trees have optimal, O(n), storage costs, it is a fact that you can't evaluate range queries any faster than O(n^{(d-1)/d}) if you only use O(n) storage. For d=2, this would be O(sqrt(n)) time complexity; and frankly that isn't going to cut it if you have 10^10 data points (who wants to wait for O(10^5) disk reads to complete on a simple range query?)

    Fortunately, it sounds like your situation you really don't need to worry too much about the general case. Because all of your data comes from a time series, you only ever have at most one value per each time coordinate. Hypothetically, what you could do is just use a range query to pull some interval of points, then as a post process go through and apply the v constraints pointwise. This would be the first thing I would try (after getting a good database implementation), and if it works then you are done! It really only makes sense to try optimizing the latter two queries if you keep running into situations where the number of points in [t0, t1] x [-infty,+infty] is orders of magnitude larger than the number of points in [t0,t1] x [v0, v1].

    0 讨论(0)
提交回复
热议问题