Saving in a file an array or DataFrame together with other information

后端 未结 6 570
误落风尘
误落风尘 2020-12-23 00:09

The statistical software Stata allows short text snippets to be saved within a dataset. This is accomplished either using notes and/or characteristics.

This is a fea

相关标签:
6条回答
  • 2020-12-23 00:51

    A practical way could be to embed meta-data directly inside the Numpy array. The advantage is that, as you'd like, there's no extra dependency and it's very simple to use in the code. However, this doesn't fully answers your question, because you still need a mechanism to save the data, and I'd recommend using jpp's solution using HDF5.

    To include metadata in an ndarray, there is an example in the documentation. You basically have to subclass an ndarray and add a field info or metadata or whatever.

    It would give (code from the link above)

    import numpy as np
    
    class ArrayWithInfo(np.ndarray):
    
        def __new__(cls, input_array, info=None):
            # Input array is an already formed ndarray instance
            # We first cast to be our class type
            obj = np.asarray(input_array).view(cls)
            # add the new attribute to the created instance
            obj.info = info
            # Finally, we must return the newly created object:
            return obj
    
        def __array_finalize__(self, obj):
            # see InfoArray.__array_finalize__ for comments
            if obj is None: return
            self.info = getattr(obj, 'info', None)
    

    To save the data through numpy, you'd need to overload the write function or use another solution.

    0 讨论(0)
  • 2020-12-23 00:56

    It's an interesting question, although very open-ended I think.

    Text Snippets
    For text snippets that have literal notes (as in, not code and not data), I really don't know what your use case is, but I don't see why I would deviate from using the usual with open() as f: ...

    Small collections of various data pieces
    Sure, your npz works. Actually what you are doing is very similar to creating a dictionary with everything you want to save and pickling that dictionary.

    See here for a discussion of the differences between pickle and npz (but mainly, npz is optimized for numpy arrays).

    Personally, I'd say if you are not storing Numpy arrays I would use pickle, and even implement a quick MyNotes class that is basically a dictionary to save stuff in it, with some additional functionality you may want.

    Collection of large objects
    For really big np.arrays or dataframes I have used before the HDF5 format. The good thing is that it is already built in into pandas and you can directly df.to_hdf5(). It does need underneath pytables -installation should be fairly painless with pip or conda- but using pytables directly can be a much bigger pain.

    Again, this idea is very similar: you are creating an HDFStore, which is pretty much a big dictionary in which you can store (almost any) objects. The benefit is that the format utilizes space in a smarter way by leveraging repetition of similar values. When I was using it to store some ~2GB dataframes, it was able to reduce it by almost a full order of magnitude (~250MB).

    One last player: feather
    Feather is a project created by Wes McKinney and Hadley Wickham on top of the Apache Arrow framework, to persist data in a binary format that is language agnostic (and therefore you can read from R and Python). However, it is still under development, and last time I checked they didn't encourage to use it for long-term storage (since the specification may change in future versions), rather than just use it for communication between R and Python.

    They both just launched Ursalabs, literally just weeks ago, that will continue growing this and similar initiatives.

    0 讨论(0)
  • 2020-12-23 00:56

    You stated as the reasons for this question:

    ... it allows me to save a variety of information, ranging from reminders and to-do lists, to information about how i generated the data, or even what the estimation method for a particular variable was.

    May I suggest a different paradigm than that offered by Stata? The notes and characteristics seem to be very limited and confined to just text. Instead, you should use Jupyter Notebook for your research and data analysis projects. It provides such a rich environment to document your workflow and capture details, thoughts and ideas as you are doing your analysis and research. It can easily be shared, and it's presentation-ready.

    Here is a gallery of interesting Jupyter Notebooks across many industries and disciplines to showcase the many features and use cases of notebooks. It may expand your horizons beyond trying to devise a way to tag simple snippets of text to your data.

    0 讨论(0)
  • 2020-12-23 00:56

    jpp's answer is pretty comprehensive, just wanted to mention that as of pandas v22 parquet is very convenient and fast option with almost no drawbacks vs csv (accept perhaps the coffee break).

    read parquet

    write parquet

    At time of writing you'll need to also

    pip install pyarrow
    

    In terms of adding information you have the metadata which is attached to the data

    import pyarrow as pa
    import pyarrow.parquet as pq
    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame(np.random.normal(size=(1000, 10)))
    
    tab = pa.Table.from_pandas(df)
    
    tab = tab.replace_schema_metadata({'here' : 'it is'})
    
    pq.write_table(tab, 'where_is_it.parq')
    
    pq.read_table('where_is_it.parq')
    which then yield a table

    Pyarrow table
    0: double
    1: double
    2: double
    3: double
    4: double
    5: double
    6: double
    7: double
    8: double
    9: double
    __index_level_0__: int64
    metadata
    --------
    {b'here': b'it is'}
    

    To get this back to pandas:

    tab.to_pandas()
    
    0 讨论(0)
  • 2020-12-23 00:57

    There are many options. I will discuss only HDF5, because I have experience using this format.

    Advantages: Portable (can be read outside of Python), native compression, out-of-memory capabilities, metadata support.

    Disadvantages: Reliance on single low-level C API, possibility of data corruption as a single file, deleting data does not reduce size automatically.

    In my experience, for performance and portability, avoid pyTables / HDFStore to store numeric data. You can instead use the intuitive interface provided by h5py.

    Store an array

    import h5py, numpy as np
    
    arr = np.random.randint(0, 10, (1000, 1000))
    
    f = h5py.File('file.h5', 'w', libver='latest')  # use 'latest' for performance
    
    dset = f.create_dataset('array', shape=(1000, 1000), data=arr, chunks=(100, 100),
                            compression='gzip', compression_opts=9)
    

    Compression & chunking

    There are many compression choices, e.g. blosc and lzf are good choices for compression and decompression performance respectively. Note gzip is native; other compression filters may not ship by default with your HDF5 installation.

    Chunking is another option which, when aligned with how you read data out-of-memory, can significantly improve performance.

    Add some attributes

    dset.attrs['Description'] = 'Some text snippet'
    dset.attrs['RowIndexArray'] = np.arange(1000)
    

    Store a dictionary

    for k, v in d.items():
        f.create_dataset('dictgroup/'+str(k), data=v)
    

    Out-of-memory access

    dictionary = f['dictgroup']
    res = dictionary['my_key']
    

    There is no substitute for reading the h5py documentation, which exposes most of the C API, but you should see from the above there is a significant amount of flexibility.

    0 讨论(0)
  • 2020-12-23 00:58

    I agree with JPP that hdf5 storage is a good option here. The difference between his solution and mine is mine uses Pandas dataframes instead of numpy arrays. I prefer the dataframe since this allows mixed types, multi-level indexing (even datetime indexing, which is VERY important for my work), and column labeling, which helps me remember how different datasets are organized. Also, Pandas provides a slew of built-in functionalities (much like numpy). Another benefit of using Pandas is it has a hdf creator built in (i.e. pandas.DataFrame.to_hdf), which I find convenient

    When storing the dataframe to h5 you have the option of storing a dictionary of metadata as well, which can be your notes to self, or actual metadata that does not need to be stored in the dataframe (I use this for setting flags as well, e.g. {'is_agl': True, 'scale_factor': 100, 'already_corrected': False, etc.}. In this regard, there is no difference between using a numpy array and a dataframe. For the full solution see my original question and solution here.

    0 讨论(0)
提交回复
热议问题