I\'m running into trouble reading a hdf5 matlab 7.3 file with Python. I\'m using h5py 2.0.1.
I can read all the matrices that are stored in the file, but I can not r
You can get the original Matlab class name of Group
and Dataset
objects by
dataset.attrs['MATLAB_class']
if dataset
contains a string, it will return b'char'
.
I assume you mean it is a cell array of strings in MATLAB? This output looks normal: the dataset is an array of objects (|O4
is the NumPy object datatype). Each object is an array of 2-byte integers (<u2
is the NumPy little-endian unsigned 2-byte integer datatype). h5py has no way of knowing that the dataset is a cell array of strings; it could just as well be a cell array of arbitrary 16-bit integers.
The easiest way to get the strings out would be to use an iterator using unichr to convert the characters, like this:
strlist = [u''.join(unichr(c) for c in h5file[obj_ref]) for obj_ref in dataset])
What this does is iterate over the dataset (for obj_ref in dataset
) to create a new list. For each object reference, it dereferences the object (h5file[obj_ref]
) to get an array of integers. It converts each integer into a character (unichr(c)
) and joins those characters all together into a Unicode string (u''.join()
).
Note that this produces a list of unicode strings. If you are absolutely sure that every string contains only ASCII characters, you can replace u''
by ''
and unichr
by chr
.
Caveat: I don't have h5py; this post is based on my experiences with MATLAB and NumPy. You may need to adjust the syntax or iteration order to suite your dataset.