How to save dictionaries and arrays in the same archive (with numpy.savez)

前端 未结 3 526
感情败类
感情败类 2020-12-16 15:37

first question here. I\'ll try to be concise.

I am generating multiple arrays containing feature information for a machine learning application. As the arrays do not

相关标签:
3条回答
  • 2020-12-16 16:01

    If you need to save your data in a structured way, you should consider using the HDF5 file format (http://www.hdfgroup.org/HDF5/). It is very flexible, easy to use, efficient, and other software might already support it (HDFView, Mathematica, Matlab, Origin..). There is a simple python binding called h5py.

    You can store datasets in a filesystem like structure and define attributes for each dataset, like a dictionary. For example:

    import numpy as np
    import h5py
    
    # some data
    table1 = np.array([(1,1), (2,2), (3,3)], dtype=[('x', float), ('y', float)])
    table2 = np.ones(shape=(3,3))
    
    # save to data to file
    h5file = h5py.File("test.h5", "w")
    h5file.create_dataset("Table1", data=table1)
    h5file.create_dataset("Table2", data=table2, compression=True)
    # add attributes
    h5file["Table2"].attrs["attribute1"] = "some info"
    h5file["Table2"].attrs["attribute2"] = 42
    h5file.close()
    

    Reading the data is also simple, you can even load just a few elements out of a large file if you want:

    h5file = h5py.File("test.h5", "r")
    # read from file (numpy-like behavior)
    print h5file["Table1"]["x"][:2]
    # read everything into memory (real numpy array)
    print np.array(h5file["Table2"])
    # read attributes
    print h5file["Table2"].attrs["attribute1"]
    

    More features and possibilities are found in the documentation and on the websites (the Quick Start Guide might be of interest).

    0 讨论(0)
  • 2020-12-16 16:01

    Put all your variables into an object and then use Pickle. It's a better way to store state information.

    0 讨论(0)
  • 2020-12-16 16:03

    As @fraxel has already suggested, using pickle is a much better option in this case. Just save a dict with your items in it.

    However, be sure to use pickle with a binary protocol. By default, it less efficient format, which will result in excessive memory usage and huge files if your arrays are large.

    saved_data = dict(outputFile, 
                      saveFeature1 = feature1, 
                      saveFeature2 = feature2, 
                      saveLabel1 = label1, 
                      saveLabel2 = label2,
                      saveString = docString)
    
    with open('test.dat', 'wb') as outfile:
        pickle.dump(saved_data, outfile, protocol=pickle.HIGHEST_PROTOCOL)
    

    That having been said, let's take a look at what's happening in more detail for illustrative purposes.

    numpy.savez expects each item to be an array. In fact, it calls np.asarray on everything you pass in.

    If you turn a dict into an array, you'll get an object array. E.g.

    import numpy as np
    
    test = {'a':np.arange(10), 'b':np.arange(20)}
    testarr = np.asarray(test)
    

    Similarly, if you make an array out of a string, you'll get a string array:

    In [1]: np.asarray('abc')
    Out[1]: 
    array('abc', 
          dtype='|S3')
    

    However, because of a quirk in the way object arrays are handled, if you pass in a single object (in your case, your dict) that isn't a tuple, list, or array, you'll get a 0-dimensional object array.

    This means that you can't index it directly. In fact, doing testarr[0] will raise an IndexError. The data is still there, but you need to add a dimension first, so you have to do yourdictionary = testarr.reshape(-1)[0].

    If all of this seems clunky, it's because it is. Object arrays are essentially always the wrong answer. (Although asarray should arguably pass in ndmin=1 to array, which would solve this particular problem, but potentially break other things.)

    savez is intended to store arrays, rather than arbitrary objects. Because of the way it works, it can store completely arbitrary objects, but it shouldn't be used that way.

    If you did want to use it, though, a quick workaround would be to do:

    np.savez(outputFile, 
             saveFeature1 = [feature1], 
             saveFeature2 = [feature2], 
             saveLabel1 = [label1], 
             saveLabel2 = [label2],
             saveString = docString)
    

    And you'd then access things with

    loadedArchive = np.load(outFile)
    loadedFeature1 = loadedArchive['saveFeature1'][0]
    loadedString = str(loadedArchive['saveString'])
    

    However, this is clearly much more clunky than just using pickle. Use numpy.savez when you're just saving arrays. In this case, you're saving nested data structures, not arrays.

    0 讨论(0)
提交回复
热议问题