问题
I am interested in monitoring some objects. I expect to get about 10000 data points every 15 minutes. (Maybe not at first, but this is the 'general ballpark'). I would also like to be able to get daily, weekly, monthly and yearly statistics. It is not critical to keep the data in the highest resolution (15 minutes) for more than two months.
I am considering various ways to store this data, and have been looking at a classic relational database, or at a schemaless database (such as SimpleDB).
My question is, what is the best way to go along doing this? I would very much prefer an open-source (and free) solution to a proprietary costly one.
Small note: I am writing this application in Python.
回答1:
HDF5, which can be accessed through h5py or PyTables, is designed for dealing with very large data sets. Both interfaces work well. For example, both h5py and PyTables have automatic compression and supports Numpy.
回答2:
RRDTool by Tobi Oetiker, definitely! It's open-source, it's been designed for exactly such use cases.
EDIT:
To provide a few highlights: RRDTool stores time-series data in a round-robin data base. It keeps raw data for a given period of time, then condenses it in a configurable way so you have fine-grained data say for a month, averaged data over a week for the last 6 months, and averaged data over a month for the last 2 years. As a side effect you data base remains the same size all of the time (so no sweating you disk may run full). This was the storage side. On the retrieval side RRDTool offers data queries that are immediately turned into graphs (e.g. png) that you can readily include in documents and web pages. It's a rock solid, proven solution that is a much generalized form over its predecessor, MRTG (some might have heard of this). And once you got into it, you will find yourself re-using it over and over again.
For a quick overview and who uses RRDTool, see also here. If you want to see which kinds of graphics you can produce, make sure you have a look at the gallery.
回答3:
plain text files? It's not clear what your 10k data points per 15 minutes translates to in terms of bytes, but in any way text files are easier to store/archive/transfer/manipulate and you can inspect the directly, just by looking at. fairly easy to work with Python, too.
回答4:
This is pretty standard data-warehousing stuff.
Lots of "facts", organized by a number of dimensions, one of which is time. Lots of aggregation.
In many cases, simple flat files that you process with simple aggregation algorithms based on defaultdict
will work wonders -- fast and simple.
Look at Efficiently storing 7.300.000.000 rows
Database choice for large data volume?
回答5:
There is an open source timeseries database under active development (.NET only for now) that I wrote. It can store massive amounts (terrabytes) of uniform data in a "binary flat file" fashion. All usage is stream-oriented (forward or reverse). We actively use it for the stock ticks storage and analysis at our company.
https://code.google.com/p/timeseriesdb/
// Create a new file for MyStruct data.
// Use BinCompressedFile<,> for compressed storage of deltas
using (var file = new BinSeriesFile<UtcDateTime, MyStruct>("data.bts"))
{
file.UniqueIndexes = true; // enforces index uniqueness
file.InitializeNewFile(); // create file and write header
file.AppendData(data); // append data (stream of ArraySegment<>)
}
// Read needed data.
using (var file = (IEnumerableFeed<UtcDateTime, MyStrut>) BinaryFile.Open("data.bts", false))
{
// Enumerate one item at a time maxitum 10 items starting at 2011-1-1
// (can also get one segment at a time with StreamSegments)
foreach (var val in file.Stream(new UtcDateTime(2011,1,1), maxItemCount = 10)
Console.WriteLine(val);
}
来源:https://stackoverflow.com/questions/1334813/what-is-the-best-open-source-solution-for-storing-time-series-data