I have selected this dataset: https://www.kaggle.com/karangadiya/fifa19
Now, I would like to convert this CSV file into the federated dataset to fit in the model.
Tensorflow provided tutorials on federated learning where they have used a pre-defined dataset. However, my question is How can I use this particular dataset for a federated learning scenario?
I'll use a different CSV dataset, but this should still address the core of this question, which is how to create a federated dataset from a CSV. Let's also assume that there is a column in that dataset which you would like to represent the client_id
s for your data.
import pandas as pd
import tensorflow as tf
import tensorflow_federated as tff
csv_url = "https://docs.google.com/spreadsheets/d/1eJo2yOTVLPjcIbwe8qSQlFNpyMhYj-xVnNVUTAhwfNU/gviz/tq?tqx=out:csv"
df = pd.read_csv(csv_url, na_values=("?",))
client_id_colname = 'native.country' # the column that represents client ID
# split client id into train and test clients
client_ids = df[client_id_colname].unique()
train_client_ids = client_ids.sample(frac=0.5).tolist()
test_client_ids = [x for x in client_ids if x not in train_client_ids]
There are a few ways to do this, but the way I'll illustrate here uses tff.simulation.ClientData.from_clients_and_fn
, which requires that we write a function that accepts a client_id
as input and returns a tf.data.Dataset
. We can easily construct this from the dataframe.
def create_tf_dataset_for_client_fn(client_id):
# a function which takes a client_id and returns a
# tf.data.Dataset for that client
client_data = df[df[client_id_colname] == client_id]
dataset = tf.data.Dataset.from_tensor_slices(client_data.to_dict('list'))
dataset = dataset.shuffle(SHUFFLE_BUFFER).batch(1).repeat(NUM_EPOCHS)
return dataset
Now, we can use the function above to create a ConcreteClientData
object for our training and test data:
train_data = tff.simulation.ClientData.from_clients_and_fn(
test_data = tff.simulation.ClientData.from_clients_and_fn(
To see one instance of the dataset, try:
example_dataset = train_data.create_tf_dataset_for_client(
example_element = iter(example_dataset).next()
# <class 'tensorflow.python.data.ops.dataset_ops.RepeatDataset'>
# {'age': <tf.Tensor: shape=(1,), dtype=int32, numpy=array([37], dtype=int32)>, 'workclass': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Local-gov'], dtype=object)>, ...
Each element of example_dataset
is a Python dictionary where the keys are strings representing feature names, and the values are tensors with one batch of those features. Now, you have a federated dataset that can be preprocessed and used for modeling.
You can convert your CSV file to federated data by first creating an h5 file from your CSV file.
Background An h5 file is a hierarchal file structure that shows metadata, this works well as the hierarchal structure represents federated user id's very well
When you are creating federated data you are creating using a client data object, client data is implemented using an h5 file,
Federated Source Code : Client Data https://github.com/tensorflow/federated/blob/master/tensorflow_federated/python/simulation/hdf5_client_data.py
- Create your h5 file
- In Federated, Experiment create a client data object , and then follow the Image Recognition tutorial on the federated main page
Creating h5 file
with h5py.File("student31.h5", 'a') as hdf:
example = hdf.create_group("examples")
for i in range(0,20):
# for data in myDataFrame:
# localList.append(str(data))
# print(type(myDataFrame))
# data.append(myDataFrame)
exampleGroup = example.create_group(str(i))
# myClientGroup = hdf.create_group(str(i))
# d1 = np.random.random(size = (100,33))
print("printing the type ")
Federated Client data Instantiation
myclient = HDF5ClientData("student31.h5")