Saving in a file an array or DataFrame together with other information

后端未结

关注

 6  576

The statistical software Stata allows short text snippets to be saved within a dataset. This is accomplished either using notes and/or characteristics.

This is a fea

相关标签:

6条回答

轻奢々

2020-12-23 00:51
A practical way could be to embed meta-data directly inside the Numpy array. The advantage is that, as you'd like, there's no extra dependency and it's very simple to use in the code. However, this doesn't fully answers your question, because you still need a mechanism to save the data, and I'd recommend using jpp's solution using HDF5.

To include metadata in an ndarray, there is an example in the documentation. You basically have to subclass an ndarray and add a field info or metadata or whatever.

It would give (code from the link above)
```
import numpy as np

class ArrayWithInfo(np.ndarray):

    def __new__(cls, input_array, info=None):
        # Input array is an already formed ndarray instance
        # We first cast to be our class type
        obj = np.asarray(input_array).view(cls)
        # add the new attribute to the created instance
        obj.info = info
        # Finally, we must return the newly created object:
        return obj

    def __array_finalize__(self, obj):
        # see InfoArray.__array_finalize__ for comments
        if obj is None: return
        self.info = getattr(obj, 'info', None)
```
To save the data through numpy, you'd need to overload the write function or use another solution.
0 讨论(0)
发布评论:

提交评论
- 加载中...
隐瞒了意图╮

2020-12-23 00:56

It's an interesting question, although very open-ended I think.

Text Snippets
For text snippets that have literal notes (as in, not code and not data), I really don't know what your use case is, but I don't see why I would deviate from using the usual with open() as f: ...

Small collections of various data pieces
Sure, your npz works. Actually what you are doing is very similar to creating a dictionary with everything you want to save and pickling that dictionary.

See here for a discussion of the differences between pickle and npz (but mainly, npz is optimized for numpy arrays).

Personally, I'd say if you are not storing Numpy arrays I would use pickle, and even implement a quick MyNotes class that is basically a dictionary to save stuff in it, with some additional functionality you may want.

Collection of large objects
For really big np.arrays or dataframes I have used before the HDF5 format. The good thing is that it is already built in into pandas and you can directly df.to_hdf5(). It does need underneath pytables -installation should be fairly painless with pip or conda- but using pytables directly can be a much bigger pain.

Again, this idea is very similar: you are creating an HDFStore, which is pretty much a big dictionary in which you can store (almost any) objects. The benefit is that the format utilizes space in a smarter way by leveraging repetition of similar values. When I was using it to store some ~2GB dataframes, it was able to reduce it by almost a full order of magnitude (~250MB).

One last player: feather
Feather is a project created by Wes McKinney and Hadley Wickham on top of the Apache Arrow framework, to persist data in a binary format that is language agnostic (and therefore you can read from R and Python). However, it is still under development, and last time I checked they didn't encourage to use it for long-term storage (since the specification may change in future versions), rather than just use it for communication between R and Python.

They both just launched Ursalabs, literally just weeks ago, that will continue growing this and similar initiatives.

0 讨论(0)
发布评论:

提交评论
- 加载中...
小蘑菇

2020-12-23 00:56

You stated as the reasons for this question:

... it allows me to save a variety of information, ranging from reminders and to-do lists, to information about how i generated the data, or even what the estimation method for a particular variable was.

May I suggest a different paradigm than that offered by Stata? The notes and characteristics seem to be very limited and confined to just text. Instead, you should use Jupyter Notebook for your research and data analysis projects. It provides such a rich environment to document your workflow and capture details, thoughts and ideas as you are doing your analysis and research. It can easily be shared, and it's presentation-ready.

Here is a gallery of interesting Jupyter Notebooks across many industries and disciplines to showcase the many features and use cases of notebooks. It may expand your horizons beyond trying to devise a way to tag simple snippets of text to your data.

0 讨论(0)
发布评论:

提交评论
- 加载中...

暖寄归人

2020-12-23 00:56

jpp's answer is pretty comprehensive, just wanted to mention that as of pandas v22 parquet is very convenient and fast option with almost no drawbacks vs csv (accept perhaps the coffee break).

read parquet

write parquet

At time of writing you'll need to also

pip install pyarrow

In terms of adding information you have the metadata which is attached to the data

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.normal(size=(1000, 10)))

tab = pa.Table.from_pandas(df)

tab = tab.replace_schema_metadata({'here' : 'it is'})

pq.write_table(tab, 'where_is_it.parq')

pq.read_table('where_is_it.parq')

which then yield a table

Pyarrow table
0: double
1: double
2: double
3: double
4: double
5: double
6: double
7: double
8: double
9: double
__index_level_0__: int64
metadata
--------
{b'here': b'it is'}

To get this back to pandas:

tab.to_pandas()

0 讨论(0)

孤街浪徒

2020-12-23 00:57
There are many options. I will discuss only HDF5, because I have experience using this format.

Advantages: Portable (can be read outside of Python), native compression, out-of-memory capabilities, metadata support.

Disadvantages: Reliance on single low-level C API, possibility of data corruption as a single file, deleting data does not reduce size automatically.

In my experience, for performance and portability, avoid pyTables / HDFStore to store numeric data. You can instead use the intuitive interface provided by h5py.

Store an array
```
import h5py, numpy as np

arr = np.random.randint(0, 10, (1000, 1000))

f = h5py.File('file.h5', 'w', libver='latest')  # use 'latest' for performance

dset = f.create_dataset('array', shape=(1000, 1000), data=arr, chunks=(100, 100),
                        compression='gzip', compression_opts=9)
```
Compression & chunking

There are many compression choices, e.g. blosc and lzf are good choices for compression and decompression performance respectively. Note gzip is native; other compression filters may not ship by default with your HDF5 installation.

Chunking is another option which, when aligned with how you read data out-of-memory, can significantly improve performance.

Add some attributes
```
dset.attrs['Description'] = 'Some text snippet'
dset.attrs['RowIndexArray'] = np.arange(1000)
```
Store a dictionary
```
for k, v in d.items():
    f.create_dataset('dictgroup/'+str(k), data=v)
```
Out-of-memory access
```
dictionary = f['dictgroup']
res = dictionary['my_key']
```
There is no substitute for reading the h5py documentation, which exposes most of the C API, but you should see from the above there is a significant amount of flexibility.
0 讨论(0)
发布评论:

提交评论
- 加载中...
花落未央

2020-12-23 00:58

I agree with JPP that hdf5 storage is a good option here. The difference between his solution and mine is mine uses Pandas dataframes instead of numpy arrays. I prefer the dataframe since this allows mixed types, multi-level indexing (even datetime indexing, which is VERY important for my work), and column labeling, which helps me remember how different datasets are organized. Also, Pandas provides a slew of built-in functionalities (much like numpy). Another benefit of using Pandas is it has a hdf creator built in (i.e. pandas.DataFrame.to_hdf), which I find convenient

When storing the dataframe to h5 you have the option of storing a dictionary of metadata as well, which can be your notes to self, or actual metadata that does not need to be stored in the dataframe (I use this for setting flags as well, e.g. {'is_agl': True, 'scale_factor': 100, 'already_corrected': False, etc.}. In this regard, there is no difference between using a numpy array and a dataframe. For the full solution see my original question and solution here.

0 讨论(0)
发布评论:

提交评论
- 加载中...