Filter HDF dataset from H5 file using attribute

问题

I have an h5 file containing multiple groups and datasets. Each dataset has associated attributes. I want to find/filter the datasets in this h5 file based upon the respective attribute associated with it.

Example:

dataset1 =cloudy(attribute) 
dataset2 =rainy(attribute)
dataset3 =cloudy(attribute)

I want to find the datasets having weather attribute/metadata as cloudy

What will be the simplest approach to get this done in pythonic way.

回答1:

There are 2 ways to access HDF5 data with Python: h5py and pytables. Both are good, with different capabilities:

h5py (from h5py FAQ): attempts to map the HDF5 feature set to NumPy as closely as possible. Some say that makes h5py more "pythonic".
PyTables (from PyTables FAQ): builds an additional abstraction layer on top of HDF5 and NumPy. It has more extensive search capabilities (compared to h5py).

When working with HDF5 data, it is important to understand the HDF5 data model. That goes beyond the scope of this post. For simplicity sake, think of the data model as a file system; where "groups" and "datasets" are like "folders" and "files". Both can have attributes. "node" is the term used to refer to a "group" or "dataset".

@Kiran Ramachandra outlined a method with h5py. Since you tagged your post with pytables, outlined below is the same process with pytables.

Note: Kiran's example assumes datasets 1,2,3 are all at the root level. You said you also have groups. Likely your groups also have some datasets. You can use the HDFView utility to view the data model and your data.

import tables as tb
h5f = tb.open_file('a.h5')

This gives you a file object you use to access additional objects (groups or datasets).

h5f.walk_nodes()

It is an iterable object to nodes and subnodes, and gives the complete HDF5 data structure (remember "nodes" can be either groups and datasets). You can list all node and types with:

for anode in h5f.walk_nodes() :
    print (anode)

Use the following to get (a non-recursive) Python List of node names:

h5f.list_nodes()

This will fetch the value of attribute cloudy from dataset1 (if it exists):

h5f.root.dataset1._f_getattr('cloudy')

If you want all attributes for a node, use this (shown for dataset1):

ds1_attrs = h5f.root.dataset1._v_attrs._v_attrnames
for attr_name in ds1_attrs :
   print ('Attribute',  attr_name,'=' ,h5f.root.dataset1._f_getattr(attr_name))

All of the above references dataset1 at the root level (h5f.root). If a data set is in a group, you simply add the group name to the path. For dataset2 in group named agroup, use:

h5f.root.agroup.dataset2._f_getattr('rainy')

This will fetch the value of attribute rainy from dataset2 in agroup (if it exists)

If you want all attributes for dataset2:

ds2_attrs = h5f.root.agroup.dataset2._v_attrs._v_attrnames
for attr_name in ds2_attrs :
   print ('Attribute',  attr_name,'=' , h5f.root.agroup.dataset2._f_getattr(attr_name))

For completeness, enclosed below is the code to create a.h5 used in my example. numpy is only required to define the dtype when creating the table. In general, HDF5 files are interchangeable (so you can open this example with h5py).

import tables as tb
import numpy as np
h5f = tb.open_file('a.h5','w')

#create dataset 1 at root level, and assign attribute
ds_dtype = np.dtype([('a',int),('b',float)])
dataset1 = h5f.create_table(h5f.root, 'dataset1', description=ds_dtype)
dataset1._f_setattr('cloudy', 'True')

#create a group at root level
h5f.create_group(h5f.root, 'agroup')

#create dataset 2,3 at root.agroup level, and assign attributes
dataset2 = h5f.create_table(h5f.root.agroup, 'dataset2', description=ds_dtype)
dataset2._f_setattr('rainy', 'True')
dataset3 = h5f.create_table(h5f.root.agroup, 'dataset3', description=ds_dtype)
dataset3._f_setattr('cloudy', 'True')

h5f.close()

回答2:

You can directly get the data sets from the h5file in the below fashion. Lets say you have a.h5 file, you can use that to filter out the contents in the below pythonic way.

import h5py
import numpy
data = h5py.File('a.h5', 'r')

Now the data is an object which could be used a dictionary. If you want the attributes then

data.keys()

This will fetch all the data attributes in the h5 files.. In your case dataset1, dataset2, dataset3

Again individual datasets is in form of a dictionary again. So,

data.['dataset1'].keys()

This will fetch cloudy, and so on if exists

data.['dataset2'].keys()

This will fetch rainy, and so on if exists

data.['dataset3'].keys()

This will fetch cloudy, and so on if exists

If you want use that data then just try to access it as a dict

data.['dataset1']['cloudy']
data.['dataset2']['rainy']
data.['dataset3']['cloudy']

Once you know the keys you can search the required keys just by using has_key() method

if data.['dataset3'].has_key('cloudy') == 1:

Then append the data onto required variable. Easiest is to convert them to numpy arrays.

回答3:

This is a modification of Sumit's code (posted in his answer). Note: I removed the f.close() statement after the create_group and create_dataset calls. After the attributes are added, the last section of code retrieves them (and prints attribute name/value under group /dataset names).

import h5py

dat=[1,2,3,45]

with h5py.File('temp.h5', 'w') as f:
    group1 = f.create_group('my_group1')
    dset11 = group1.create_dataset('my_dataset11', data=dat, compression=9)
    dset12 = group1.create_dataset('my_dataset12', data=dat, compression=9)
    dset13 = group1.create_dataset('my_dataset13', data=dat, compression=9)
    group2 = f.create_group('my_group2')
    dset21 = group2.create_dataset('my_dataset21', data=dat, compression=9)
    dset22 = group2.create_dataset('my_dataset22', data=dat, compression=9)
    dset23 = group2.create_dataset('my_dataset23', data=dat, compression=9)

    groups=list(f.keys())

    grp=f[groups[0]]
    dataset=list(grp.keys())

    for each in dataset:
        grp[each].attrs['env']='cloudy'
        grp[each].attrs['temp']=25
#        grp[each]._f_setattr('cloudy', 'True')

    grp=f[groups[1]]
    dataset=list(grp.keys())

    for each in dataset:
        grp[each].attrs['env']='rainy'
        grp[each].attrs['temp']=20
#        grp[each]._f_setattr('rainy', 'True')

    for each_grp in groups:
        dataset=list(f[each_grp].keys())
        for each_ds in dataset:
            print ('For ', each_grp, '.', each_ds,':')
            print ('\tenv =', f[each_grp][each_ds].attrs['env'])
            print ('\ttemp=',f[each_grp][each_ds].attrs['temp'])

f.close()

Output should look like this:

For  my_group1 . my_dataset11 :
    env = cloudy
    temp= 25
For  my_group1 . my_dataset12 :
    env = cloudy
    temp= 25
For  my_group1 . my_dataset13 :
    env = cloudy
    temp= 25
For  my_group2 . my_dataset21 :
    env = rainy
    temp= 20
For  my_group2 . my_dataset22 :
    env = rainy
    temp= 20
For  my_group2 . my_dataset23 :
    env = rainy
    temp= 20

来源：https://stackoverflow.com/questions/54020109/filter-hdf-dataset-from-h5-file-using-attribute

标签

python

hdf5

h5py

pytables

hdfql