Tensorflow-IO Dataset input pipeline with very large HDF5 files

问题

I have very big training (30Gb) files.
Since all the data does not fit in my available RAM, I want to read the data by batch.
I saw that there is Tensorflow-io package which implemented a way to read HDF5 into Tensorflow this way thanks to the function tfio.IODataset.from_hdf5()
Then, since tf.keras.model.fit() takes a tf.data.Dataset as input containing both samples and targets, I need to zip my X and Y together and then use .batch and .prefetch to load in memory just the necessary data. For testing I tried to apply this method to smaller samples: training (9Gb), validation (2.5Gb) and testing (1.2Gb) which I know work well because they can fit into memory and I have good results (70% accuracy and <1 loss).
The training files are stored in HDF5 files split into samples (X) and labels (Y) files like so:

X_learn.hdf5  
X_val.hdf5  
X_test.hdf5  
Y_test.hdf5  
Y_learn.hdf5  
Y_val.hdf5

Here is my code:

BATCH_SIZE = 2048
EPOCHS = 100

# Create an IODataset from a hdf5 file's dataset object  
x_val = tfio.IODataset.from_hdf5(path_hdf5_x_val, dataset='/X_val')
y_val = tfio.IODataset.from_hdf5(path_hdf5_y_val, dataset='/Y_val')
x_test = tfio.IODataset.from_hdf5(path_hdf5_x_test, dataset='/X_test')
y_test = tfio.IODataset.from_hdf5(path_hdf5_y_test, dataset='/Y_test')
x_train = tfio.IODataset.from_hdf5(path_hdf5_x_train, dataset='/X_learn')
y_train = tfio.IODataset.from_hdf5(path_hdf5_y_train, dataset='/Y_learn')
 
# Zip together samples and corresponding labels
train = tf.data.Dataset.zip((x_train,y_train)).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)
test = tf.data.Dataset.zip((x_test,y_test)).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)
val = tf.data.Dataset.zip((x_val,y_val)).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)

# Build the model
model = build_model()
 
# Compile the model with custom learing rate function for Adam optimizer
model.compile(loss='categorical_crossentropy',
               optimizer=Adam(lr=lr_schedule(0)),
               metrics=['accuracy'])

# Fit model with class_weights calculated before
model.fit(train,
          epochs=EPOCHS,
          class_weight=class_weights_train,
          validation_data=val,
          shuffle=True,
          callbacks=callbacks)

This code runs but the loss goes very high (300+) and accuracy drops to 0 (0.30 -> 4*e^-5) right from the beginning... I don't understand what I am doing wrong, am I missing something ?

回答1:

Providing the solution here (Answer Section), even though it is present in the Comment Section for the benefit of the community.

There was no issue with the code, its actually with the data (not preprocessed properly), hence model not able to learning well, which leads to strange loss and accuracy.

来源：https://stackoverflow.com/questions/60097970/tensorflow-io-dataset-input-pipeline-with-very-large-hdf5-files

标签

python

tensorflow

machine-learning

tensorflow2.0

tensorflow-datasets