问题
I am trying to serialize a large (~10**6 rows, each with ~20 values) list, to be used later by myself (so pickle's lack of safety isn't a concern).
Each row of the list is a tuple of values, derived from some SQL database. So far, I have seen datetime.datetime
, strings, integers, and NoneType, but I might eventually have to support additional data types.
For serialization, I've considered pickle (cPickle), json, and plain text - but only pickle saves the type information: json can't serialize datetime.datetime
, and plain text has its obvious disadvantages.
However, cPickle is pretty slow for data this large, and I'm looking for a faster alternative.
回答1:
I think you should give PyTables a look. It should be ridiculously fast, at least faster than using an RDBMS, since it's very lax and doesn't impose any read/write restrictions, plus you get a better interface for managing your data, at least compared to pickling it.
回答2:
Pickle is actually quite fast so long as you aren't using the (default) ASCII protocol. Just make sure to dump using protocol=pickle.HIGHEST_PROTOCOL
.
回答3:
Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler.
advantages over XML:
- are simpler
- are 3 to 10 times smaller
- are 20 to 100 times faster
- are less ambiguous
- generate data access classes that are easier to use programmatically
https://developers.google.com/protocol-buffers/docs/pythontutorial
回答4:
- Protocol Buffer - e.g. used in Caffe; maintains type information, but you have to put quite much effort in it compared to pickle
- MessagePack: See python package - supports streaming (source)
- BSON: see python package docs
回答5:
For hundreds of thousands of simple (up to JSON-compatible) complexity Python objects, I've found the best combination of simplicity, speed, and size by combining:
- py-ubjson
- gzip
It beats pickle
and cPickle
options by orders of magnitude.
with gzip.open(filename, 'wb') as f:
ubjson.dump(items, f)
with gzip.open(filename, 'rb') as f:
return ubjson.load(f)
回答6:
I usually serialize to plain text (*.csv) because I found it to be fastest. The csv module works quite well. See http://docs.python.org/library/csv.html
If you have to deal with unicode for your strings, check out the UnicodeReader and UnicodeWriter examples at the end.
If you serialize for your own future use, I guess it would suffice to know that you have the same data type per csv column (e.g., string are always on column 2).
回答7:
Avro seems to be promising and properly designed but yet non popular solution.
回答8:
Just for the sake of completeness - there is also dill
library that extends pickle
.
How to dill (pickle) to file?
来源:https://stackoverflow.com/questions/9897345/pickle-alternatives