I try to optimize my data input pipeline. The dataset is a set of 450 TFRecord files of size ~70MB each, hosted on GCS. The job is executed with GCP ML Engine. There is no GPU.<
Mentioning the Solution and the Important observations of @AlexisBRENON in the Answer Section, for the benefit of the Community.
Below mentioned are the Important Observations:
TFRecordDataset
interleaving
is a legacy one, so interleave
function is better.batch
before map
is a good habit (vectorizing your function) and reduce the number of times the mapped function is called.repeat
anymore. Since TF2.0, the Keras model API supports the dataset API and can use cache (see the SO post)VarLenFeature
to a FixedLenSequenceFeature
, removing a useless call to tf.sparse.to_dense
.Code for the Pipeline, with improved performance, in line with above observations is mentioned below:
def build_dataset(file_pattern):
tf.data.Dataset.list_files(
file_pattern
).interleave(
TFRecordDataset,
cycle_length=tf.data.experimental.AUTOTUNE,
num_parallel_calls=tf.data.experimental.AUTOTUNE
).shuffle(
2048
).batch(
batch_size=64,
drop_remainder=True,
).map(
map_func=parse_examples_batch,
num_parallel_calls=tf.data.experimental.AUTOTUNE
).cache(
).prefetch(
tf.data.experimental.AUTOTUNE
)
def parse_examples_batch(examples):
preprocessed_sample_columns = {
"features": tf.io.FixedLenSequenceFeature((), tf.float32, allow_missing=True),
"booleanFeatures": tf.io.FixedLenFeature((), tf.string, ""),
"label": tf.io.FixedLenFeature((), tf.float32, -1)
}
samples = tf.io.parse_example(examples, preprocessed_sample_columns)
bits_to_float = tf.io.decode_raw(samples["booleanFeatures"], tf.uint8)
return (
(samples['features'], bits_to_float),
tf.expand_dims(samples["label"], 1)
)