Saving in a file an array or DataFrame together with other information

后端未结

关注

 6  575

误落风尘 2020-12-23 00:09

The statistical software Stata allows short text snippets to be saved within a dataset. This is accomplished either using notes and/or characteristics.

This is a fea

6条回答

隐瞒了意图╮ (楼主)

2020-12-23 00:56

It's an interesting question, although very open-ended I think.

Text Snippets
For text snippets that have literal notes (as in, not code and not data), I really don't know what your use case is, but I don't see why I would deviate from using the usual with open() as f: ...

Small collections of various data pieces
Sure, your npz works. Actually what you are doing is very similar to creating a dictionary with everything you want to save and pickling that dictionary.

See here for a discussion of the differences between pickle and npz (but mainly, npz is optimized for numpy arrays).

Personally, I'd say if you are not storing Numpy arrays I would use pickle, and even implement a quick MyNotes class that is basically a dictionary to save stuff in it, with some additional functionality you may want.

Collection of large objects
For really big np.arrays or dataframes I have used before the HDF5 format. The good thing is that it is already built in into pandas and you can directly df.to_hdf5(). It does need underneath pytables -installation should be fairly painless with pip or conda- but using pytables directly can be a much bigger pain.

Again, this idea is very similar: you are creating an HDFStore, which is pretty much a big dictionary in which you can store (almost any) objects. The benefit is that the format utilizes space in a smarter way by leveraging repetition of similar values. When I was using it to store some ~2GB dataframes, it was able to reduce it by almost a full order of magnitude (~250MB).

One last player: feather
Feather is a project created by Wes McKinney and Hadley Wickham on top of the Apache Arrow framework, to persist data in a binary format that is language agnostic (and therefore you can read from R and Python). However, it is still under development, and last time I checked they didn't encourage to use it for long-term storage (since the specification may change in future versions), rather than just use it for communication between R and Python.

They both just launched Ursalabs, literally just weeks ago, that will continue growing this and similar initiatives.

0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...