问题
I have been trying to get the TPUs working for a classification project. The dataset is quite big, ~150gb, so I cannot load it all in memory. Thus I have been using Dask. Dask doesn't integrate with tf.Dataset directly so I have to create a loader inspired by parallelising tf.data.Dataset.from_generator
The dataset generates correctly when replacing the .fit with:
iterator = ds.make_one_shot_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
for i in range(1):
val = sess.run(next_element)
print(val)
The testcode:
tf.keras.backend.clear_session()
N_chunk_generators=64
batch_size=128
chunk_size=8
def gen(chunk):
for ibatch in range(chunk*chunk_size, (chunk+1)*chunk_size):
yield (X[ibatch*(batch_size):(ibatch+1)*(batch_size)].compute().astype('float32'),
np.expand_dims(y[ibatch*(batch_size):(ibatch+1)*(batch_size)].compute().astype('float32'), axis=2))
def dataset_for_n(n):
return tf.data.Dataset.from_generator(gen,
(tf.float32, tf.float32),
(tf.TensorShape([None, 1024, 21]), tf.TensorShape([None, 1024,1])),
args=[n]
)
ds = tf.data.Dataset.range(N_chunk_generators).flat_map(dataset_for_n)
ds = ds.prefetch(4 * batch_size).repeat()
def make_model():
input_shape = (sample_length, 21)
model = Sequential([
LSTM(100, input_shape=input_shape, return_sequences=True),
Dense(1,activation='sigmoid')
])
model.compile(
optimizer=tf.train.RMSPropOptimizer(learning_rate=0.01),
loss='binary_crossentropy',
metrics=['acc']
)
return model
TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']
resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)
tf.contrib.distribute.initialize_tpu_system(resolver)
strategy = tf.contrib.distribute.TPUStrategy(resolver)
with strategy.scope():
model = make_model()
model.summary()
model.fit(ds, epochs=1, steps_per_epoch=1)
But when using the .fit and TPUs the Session is lost:
W0615 08:41:46.915936 139858515244928 tpu_strategy_util.py:56] TPU system %s has already been initialized. Reinitializing the TPU can cause previously created variables on TPU to be lost.
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm (LSTM) (None, 1024, 100) 48800
_________________________________________________________________
dense (Dense) (None, 1024, 1) 101
=================================================================
Total params: 48,901
Trainable params: 48,901
Non-trainable params: 0
_________________________________________________________________
---------------------------------------------------------------------------
AbortedError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1355 try:
-> 1356 return fn(*args)
1357 except errors.OpError as e:
10 frames
AbortedError: Session 3de99dcb7d452e4f is not found.
During handling of the above exception, another exception occurred:
AbortedError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1368 pass
1369 message = error_interpolation.interpolate(message, self._graph)
-> 1370 raise type(e)(node_def, op, message)
1371
1372 def _extend_graph(self):
AbortedError: Session 3de99dcb7d452e4f is not found.
回答1:
I think I have solved the problem, the issue is that the file is in the local filesystem which is not supported by TPU, the error message is very strange though.
Moving to TFRecords instead solved the problem:
def parse_tf(proto):
print(proto)
features = {"X": tf.FixedLenFeature([1024*21], tf.float32, default_value=None),
"Y": tf.FixedLenFeature([1024], tf.float32, default_value=None),
"x_shape": tf.FixedLenFeature([2], tf.int64, default_value=None),
"y_shape": tf.FixedLenFeature([1], tf.int64, default_value=None)}
parsed_features = tf.parse_single_example(proto, features)
return tf.reshape(parsed_features["X"], [1024,21]), tf.reshape(parsed_features["Y"], [1024,1])
tfrecords_dataset = tf.data.TFRecordDataset(["gs://BUCKETNAME/test2.tfrecords"])
ds = tfrecords_dataset.map(parse_tf).batch(64)
Please see this excellent gist for how to generate the TFRecords from a numpy array.
https://gist.github.com/jekoehler/4e8a32187ce233f23d452cb4ee1ab5c8
来源:https://stackoverflow.com/questions/56608850/session-lost-with-keras-and-tpus-in-google-colab