问题
I am training a deep learning model with Tensorflow 2 and Keras. I read my big CSV file with tf.data.experimental.make_csv_dataset
and then split it into train and test datasets. However, I need to split my train dataset into three parts since my deep learning model takes two set of inputs in different layers so I need to pass [x1_train, x2_train],y_train
to model.fit
.
My question is that how can I split train_dataset
into x1_train,x2_train
and y_train
? (some features shall be in x1_train
and some features shall be in x2_train
).
My code:
def get_dataset(file_path, **kwargs):
dataset = tf.data.experimental.make_csv_dataset(
file_path,
batch_size=64,
label_name=LABEL_COLUMN,
na_value="?",
num_epochs=1,
ignore_errors=True,
**kwargs)
return dataset
full_dataset = get_dataset(dataset_path)
full_dataset = full_dataset.shuffle(buffer_size=400000)
train_dataset = full_dataset.take(360000)
test_dataset = full_dataset.skip(360000)
test_dataset = test_dataset.take(40000)
x1_train =train_dataset[:,0:2820]
x2_train =train_dataset[:,2820:2822]
y_train=train_dataset[:,2822]
x1_test =x_test[:,0:2820]
x2_test =x_test[:,2820:2822]
y_test=test_dataset[:,2822]
model.fit([x1_train,x2_train],y_train,validation_data=[x1_test,x2_test],y_test, callbacks=callbacks_list, verbose=1,epochs=EPC)
Error message:
x1_train =train_dataset[:,0:2820]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'TakeDataset' object is not subscriptable
回答1:
As mentioned in the comments sections, you can use map
method Dataset
object which is returned by make_csv_dataset
in order to split and combine the samples according to your model's expected input format.
For example, suppose we have a CSV file containing the following data:
a,b,c,d,e
1,2,3,4,111
5,6,7,8,222
9,10,11,12,333
13,14,15,16,444
Now, suppose we want to read this CSV file with maks_csv_dataset
function; however, our model has two input layers named input1
and input2
(set using name
argument of Input
layer) where input1
is fed the feature values in column a
and b
, and the input2
uses the feature values in column c
and d
. Further, the column e
is our target (i.e. label) column.
So let's first read this data and see how it looks like:
from pprint import pprint
dataset = tf.data.experimental.make_csv_dataset(
'data.csv',
batch_size=2,
label_name='e',
num_epochs=1,
)
for x in dataset:
pprint(x)
"""
The printed result:
(OrderedDict([('a',
<tf.Tensor: shape=(2,), dtype=int32, numpy=array([5, 1], dtype=int32)>),
('b',
<tf.Tensor: shape=(2,), dtype=int32, numpy=array([6, 2], dtype=int32)>),
('c',
<tf.Tensor: shape=(2,), dtype=int32, numpy=array([7, 3], dtype=int32)>),
('d',
<tf.Tensor: shape=(2,), dtype=int32, numpy=array([8, 4], dtype=int32)>)]),
<tf.Tensor: shape=(2,), dtype=int32, numpy=array([222, 111], dtype=int32)>)
(OrderedDict([('a',
<tf.Tensor: shape=(2,), dtype=int32, numpy=array([13, 9], dtype=int32)>),
('b',
<tf.Tensor: shape=(2,), dtype=int32, numpy=array([14, 10], dtype=int32)>),
('c',
<tf.Tensor: shape=(2,), dtype=int32, numpy=array([15, 11], dtype=int32)>),
('d',
<tf.Tensor: shape=(2,), dtype=int32, numpy=array([16, 12], dtype=int32)>)]),
<tf.Tensor: shape=(2,), dtype=int32, numpy=array([444, 333], dtype=int32)>)
"""
As you can see, the first element of each batch is a dictionary mapping column names to the respective feature values. Now, let's use map
method to split and combine these feature values into proper format for our model:
first_input_cols = ['a', 'b']
second_input_cols = ['c', 'd']
def split_and_combine_batch_samples(samples, targets):
inp1 = []
for k in first_input_cols:
inp1.append(samples[k])
inp2 = []
for k in second_input_cols:
inp2.append(samples[k])
inp1 = tf.stack(inp1, axis=-1)
inp2 = tf.stack(inp2, axis=-1)
return {'input1': inp1, 'input2': inp2}, targets
dataset = dataset.map(split_and_combine_batch_samples)
for x in dataset:
pprint(x)
"""
The printed values:
({'input1': <tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[ 9, 10],
[13, 14]], dtype=int32)>,
'input2': <tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[11, 12],
[15, 16]], dtype=int32)>},
<tf.Tensor: shape=(2,), dtype=int32, numpy=array([333, 444], dtype=int32)>)
({'input1': <tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[5, 6],
[1, 2]], dtype=int32)>,
'input2': <tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[7, 8],
[3, 4]], dtype=int32)>},
<tf.Tensor: shape=(2,), dtype=int32, numpy=array([222, 111], dtype=int32)>)
"""
That's it! Now you can further modify this new modified dataset (e.g. use take
, shuffle
, etc.) and when ready you can give it to fit
method of your model (don't forget to give names to input layers of your model, though).
来源:https://stackoverflow.com/questions/63271897/splitting-tensorflow-dataset-created-with-make-csv-dataset-into-3-parts-x1-trai