(I have posted the question on https://github.com/tensorflow/federated/issues/793 and maybe also here!)
I have customized my own data and model to federated interfac
Generally, TFF considers the feeding of data to be part of the "Python driver loop", which is a helpful distinction to make when writing TFF code.
In fact, when writing TFF, there are generally three levels at which one may be writing:
tff.federated_sum
inside of a tff.federated_computation
decorator; writing this line declares "this data is moved from clients to server, and aggregated via the sum operator".If this breakdown is kept in mind, using a generator or some other lazy-evaluation-style construct to feed data in to a federated computation becomes relatively simple; it is just done at the Python level.
One way this could be done is via the create_tf_dataset_for_client method on the ClientData
object; as you loop over rounds, your Python code can select from the list of client_ids
, then you can instantiate a new list of tf.data.Datasets
and pass them in as your new set of client data. An example of this relatively simple usage would be here, and a more advanced usage (involving defining a custom client_datasets_fn
which takes client_id
as a parameter, and passing it to a separately-defined training loop would be here, in the code associated to this paper.
One final note: instantiating a tf.data.Dataset
does not actually load the dataset into memory; the dataset is only loaded in when it is iterated over. One helpful tip I have received from the lead author of tf.data.Dataset
is to think of tf.data.Dataset
more as a "dataset recipe" than a literal instantiation of the dataset itself. It has been suggested that perhaps a better name would have been DataSource
for this construct; hopefully that may help the mental model on what is actually happening. Similarly, using the tff.simulation.ClientData
object generally shouldn't really load anything into memory until it is iterated over in training on the clients; this should make some nuances around managing dataset memory simpler.