Saving in a file an array or DataFrame together with other information

后端 未结 6 569
误落风尘
误落风尘 2020-12-23 00:09

The statistical software Stata allows short text snippets to be saved within a dataset. This is accomplished either using notes and/or characteristics.

This is a fea

6条回答
  •  隐瞒了意图╮
    2020-12-23 00:56

    It's an interesting question, although very open-ended I think.

    Text Snippets
    For text snippets that have literal notes (as in, not code and not data), I really don't know what your use case is, but I don't see why I would deviate from using the usual with open() as f: ...

    Small collections of various data pieces
    Sure, your npz works. Actually what you are doing is very similar to creating a dictionary with everything you want to save and pickling that dictionary.

    See here for a discussion of the differences between pickle and npz (but mainly, npz is optimized for numpy arrays).

    Personally, I'd say if you are not storing Numpy arrays I would use pickle, and even implement a quick MyNotes class that is basically a dictionary to save stuff in it, with some additional functionality you may want.

    Collection of large objects
    For really big np.arrays or dataframes I have used before the HDF5 format. The good thing is that it is already built in into pandas and you can directly df.to_hdf5(). It does need underneath pytables -installation should be fairly painless with pip or conda- but using pytables directly can be a much bigger pain.

    Again, this idea is very similar: you are creating an HDFStore, which is pretty much a big dictionary in which you can store (almost any) objects. The benefit is that the format utilizes space in a smarter way by leveraging repetition of similar values. When I was using it to store some ~2GB dataframes, it was able to reduce it by almost a full order of magnitude (~250MB).

    One last player: feather
    Feather is a project created by Wes McKinney and Hadley Wickham on top of the Apache Arrow framework, to persist data in a binary format that is language agnostic (and therefore you can read from R and Python). However, it is still under development, and last time I checked they didn't encourage to use it for long-term storage (since the specification may change in future versions), rather than just use it for communication between R and Python.

    They both just launched Ursalabs, literally just weeks ago, that will continue growing this and similar initiatives.

提交回复
热议问题