Pandas to_hdf succeeds but then read_hdf fails

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-11 11:07:04


Pandas to_hdf succeeds but then read_hdf fails when I use custom objects as column headers (I use custom objects because I need to store other info in them).

Is there some way to make this work? Or is this just a Pandas bug or PyTables bug?

As an example, below, I will show first making a DataFrame foo that uses string column headers, and everything works fine with to_hdf/read_hdf, but then changing foo to use a custom Col class for column headers, to_hdf still works fine but then read_hdf raises assertion error:

In [48]: foo = pd.DataFrame(np.random.randn(2, 3), columns = ['aaa', 'bbb', 'ccc'])

In [49]: foo
    aaa       bbb       ccc
0 -0.434303  0.174689  1.373971
1 -0.562228  0.862092 -1.361979

In [50]: foo.to_hdf('foo.h5', 'foo')

In [51]: bar = pd.read_hdf('foo.h5', 'foo')

In [52]: bar
    aaa       bbb       ccc
0 -0.434303  0.174689  1.373971
1 -0.562228  0.862092 -1.361979

In [52]: 

In [53]: class Col(object):
...:     def __init__(self, name, other_info):
...: = name
...:         self.other_info = other_info
...:     def __str__(self):
...:         return

In [54]: foo = pd.DataFrame(np.random.randn(2, 3), columns = [Col('aaa', {'z': 5}), Col('bbb', {'y': True}), Col('ccc', {})])

In [55]: foo
    aaa       bbb       ccc
0 -0.830503  1.066178  1.057349
1  0.406967 -0.131430  1.970204

In [56]: foo.to_hdf('foo.h5', 'foo')

In [57]: bar = pd.read_hdf('foo.h5', 'foo')
AssertionError                            Traceback (most recent call last)
<ipython-input-57-888b061a1d2c> in <module>()
----> 1 bar = pd.read_hdf('foo.h5', 'foo')

/.../python3.4/site-packages/pandas/io/ in read_hdf(path_or_buf, key, **kwargs)
331     try:
--> 332         return, auto_close=auto_close, **kwargs)
333     except:
334         # if there is an error, close the store

/.../python3.4/site-packages/pandas/io/ in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
672                            auto_close=auto_close)
--> 674         return it.get_result()
676     def select_as_coordinates(

/.../python3.4/site-packages/pandas/io/ in get_result(self, coordinates)
   1367         # directly return the result
-> 1368         results = self.func(self.start, self.stop, where)
   1369         self.close()
   1370         return results

/.../python3.4/site-packages/pandas/io/ in func(_start, _stop, _where)
665             return, stop=_stop,
666                           where=_where,
--> 667                           columns=columns, **kwargs)
669         # create the iterator

/.../python3.4/site-packages/pandas/io/ in read(self, **kwargs)
   2792             blocks.append(blk)
-> 2794         return self.obj_type(BlockManager(blocks, axes))
   2796     def write(self, obj, **kwargs):

/.../python3.4/site-packages/pandas/core/ in __init__(self, blocks, axes, do_integrity_check, fastpath)
   2180         self._consolidate_check()
-> 2182         self._rebuild_blknos_and_blklocs()
   2184     def make_empty(self, axes=None):

/.../python3.4/site-packages/pandas/core/ in _rebuild_blknos_and_blklocs(self)
   2272         if (new_blknos == -1).any():
-> 2273             raise AssertionError("Gaps in blk ref_locs")
   2275         self._blknos = new_blknos

AssertionError: Gaps in blk ref_locs


So Jeff answered (a) "this is not supported" and (b) "if you have meta-data then write it to the attributes".

Question 1 regarding (a): My column header objects have methods to return their properties, etc. For example, instead of a column header string 'x5y3z8' where I would have to parse out the values, I can simply do col_header.x (gives 5) col_header.y (gives 3) etc. This is very object-oriented and pythonic, instead of using a string to store info and having to parse it every time to retrieve info. How do you suggest I replace my current column header objects in a nice way (that's also supported)?

(BTW, you might look at 'x5y3z8' and think hierarchical index works, but that is not the case because not every column header is 'x#y#z#'. I might have one column 'foo' of strings, another one 'bar5baz7' of ints, and another 'x5y3z8' of floats. The column headers aren't uniform.)

Question 2 regarding (a): When you say it's not supported, are you specifically talking about to_hdf/read_hdf not supporting it, or are you actually saying that Pandas in general doesn't support it? If it's only the HDF5 support that's missing, then I could switch to some other way of saving the DataFrames to disk and have it work, right? Do you foresee any problems with that in the future? Will this ever break with to_pickle/read_pickle, for example? (I lose performance, but got to give up something, right?)

Question 3 regarding (b): What do you mean by "if you have meta-data then write it to the attributes". Attributes of what? A simple example would help me a lot. I'm pretty new to Pandas. Thanks!


This is not a supported feature.

This will raise in the next version of pandas (on the writing), for format='table'. Should for fixed as well, but that's not implemented. This is simply not supported, nor likely to be. You should just use strings. If you have meta-data then write it to the attributes.

