问题
I am using tf.data.Dataset to prepare a streaming dataset which is used to train a tf.kears model. With kedro, is there a way to create a node and return the created tf.data.Dataset
to use it in the next training node?
The MemoryDataset will probably not work because tf.data.Dataset
cannot be pickled (deepcopy
isn't possible), see also this SO question. According to issue #91 the deep copy in MemoryDataset
is done to avoid modifying the data by some other node. Can someone please elaborate a bit more on why/how this concurrent modification could happen?
From the docs, there seems to be a copy_mode = "assign"
. Would it be possible to use this option in case the data is not picklable?
Another solution (also mentioned in issue 91) is to use just a function to generate the streaming tf.data.Dataset
inside the training node, without having the preceding dataset generation node. However, I am not sure what the drawbacks of this approach will be (if any). Would be greate if someone could give some examples.
Also, I would like to avoid storing the complete output of the streaming dataset, for example using tfrecords or tf.data.experimental.save as these options would use a lot of disk storage.
Is there a way to pass just the created tf.data.Dataset
object to use it for the training node?
来源:https://stackoverflow.com/questions/63730066/how-to-use-tf-data-dataset-with-kedro