I have a class like this:
class C:
def __init__(self, id, user_id, photo):
self.id = id
self.user_id = user_id
self.photo =
Although you can store the whole data structure in a single HDF5 table, it is probably much easier to store the described class as three separate variables - two 1D arrays of integers and a data structure for storing your 'photo' attribute.
If you care about file size and speed and do not care about human-readability of your files, you can model your 64 bool values either as 8 1D arrays of UINT8 or a 2D array N x 8 of UINT8 (or CHARs). Then, you can implement a simple interface that would pack your bool values into bits of UINT8 and back (e.g., How to convert a boolean array to an int array)
As far as know, there are no built-in search functions in HDF5, but you can read in the variable containing user_ids
and then simply use Python to find indexes of all elements matching your user_id
.
Once you have the indexes, you can read in the relevant slices of your other variables. HDF5 natively supports efficient slicing, but it works on ranges, so you might want to think how to store records with the same user_id
in continuous chunks, see discussion over here
h5py: Correct way to slice array datasets
You might also want to look into pytables - a python interace that builds over hdf5 to store data in table-like strucutres.
import numpy as np
import h5py
class C:
def __init__(self, id, user_id, photo):
self.id = id
self.user_id = user_id
self.photo = photo
def write_records(records, file_out):
f = h5py.File(file_out, "w")
dset_id = f.create_dataset("id", (1000000,), dtype='i')
dset_user_id = f.create_dataset("user_id", (1000000,), dtype='i')
dset_photo = f.create_dataset("photo", (1000000,8), dtype='u8')
dset_id[0:len(records)] = [r.id for r in records]
dset_user_id[0:len(records)] = [r.user_id for r in records]
dset_photo[0:len(records)] = [np.packbits(np.array(r.photo, dtype='bool').astype(int)) for r in records]
f.close()
def read_records_by_id(file_in, record_id):
f = h5py.File(file_in, "r")
dset_id = f["id"]
data = dset_id[0:2]
res = []
for idx in np.where(data == record_id)[0]:
record = C(f["id"][idx:idx+1][0], f["user_id"][idx:idx+1][0], np.unpackbits( np.array(f["photo"][idx:idx+1][0], dtype='uint8') ).astype(bool))
res.append(record)
return res
m = [ True, False, True, True, False, True, True, True]
m = m+m+m+m+m+m+m+m
records = [C(1, 3, m), C(34, 53, m)]
# Write records to file
write_records(records, "mytestfile.h5")
# Read record from file
res = read_records_by_id("mytestfile.h5", 34)
print res[0].id
print res[0].user_id
print res[0].photo