Session lost with Keras and TPUs in Google Colab

问题

I have been trying to get the TPUs working for a classification project. The dataset is quite big, ~150gb, so I cannot load it all in memory. Thus I have been using Dask. Dask doesn't integrate with tf.Dataset directly so I have to create a loader inspired by parallelising tf.data.Dataset.from_generator

The dataset generates correctly when replacing the .fit with:

iterator = ds.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    for i in range(1):
        val = sess.run(next_element)
        print(val)

The testcode:

tf.keras.backend.clear_session()

N_chunk_generators=64
batch_size=128
chunk_size=8

def gen(chunk):
  for ibatch in range(chunk*chunk_size, (chunk+1)*chunk_size):
    yield (X[ibatch*(batch_size):(ibatch+1)*(batch_size)].compute().astype('float32'), 
      np.expand_dims(y[ibatch*(batch_size):(ibatch+1)*(batch_size)].compute().astype('float32'), axis=2))

def dataset_for_n(n):
  return tf.data.Dataset.from_generator(gen, 
                                      (tf.float32, tf.float32), 
                                      (tf.TensorShape([None, 1024, 21]), tf.TensorShape([None, 1024,1])),
                                       args=[n]
                                     )

ds = tf.data.Dataset.range(N_chunk_generators).flat_map(dataset_for_n)
ds = ds.prefetch(4 * batch_size).repeat()



def make_model():
  input_shape = (sample_length, 21)

  model = Sequential([
      LSTM(100, input_shape=input_shape, return_sequences=True),
      Dense(1,activation='sigmoid')
  ])

  model.compile(
      optimizer=tf.train.RMSPropOptimizer(learning_rate=0.01),
      loss='binary_crossentropy',
      metrics=['acc']
  )

  return model



TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']
resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)
tf.contrib.distribute.initialize_tpu_system(resolver)
strategy = tf.contrib.distribute.TPUStrategy(resolver)

with strategy.scope():
  model = make_model()
  model.summary()

model.fit(ds, epochs=1, steps_per_epoch=1)

But when using the .fit and TPUs the Session is lost:

W0615 08:41:46.915936 139858515244928 tpu_strategy_util.py:56] TPU system %s has already been initialized. Reinitializing the TPU can cause previously created variables on TPU to be lost.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm (LSTM)                  (None, 1024, 100)         48800     
_________________________________________________________________
dense (Dense)                (None, 1024, 1)           101       
=================================================================
Total params: 48,901
Trainable params: 48,901
Non-trainable params: 0
_________________________________________________________________
---------------------------------------------------------------------------
AbortedError                              Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1355     try:
-> 1356       return fn(*args)
   1357     except errors.OpError as e:

10 frames
AbortedError: Session 3de99dcb7d452e4f is not found.

During handling of the above exception, another exception occurred:

AbortedError                              Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1368           pass
   1369       message = error_interpolation.interpolate(message, self._graph)
-> 1370       raise type(e)(node_def, op, message)
   1371 
   1372   def _extend_graph(self):

AbortedError: Session 3de99dcb7d452e4f is not found.

回答1:

I think I have solved the problem, the issue is that the file is in the local filesystem which is not supported by TPU, the error message is very strange though.

Moving to TFRecords instead solved the problem:

def parse_tf(proto):
  print(proto)
  features = {"X": tf.FixedLenFeature([1024*21], tf.float32, default_value=None),
              "Y": tf.FixedLenFeature([1024], tf.float32, default_value=None),
              "x_shape": tf.FixedLenFeature([2], tf.int64, default_value=None),
              "y_shape": tf.FixedLenFeature([1], tf.int64, default_value=None)}
  parsed_features = tf.parse_single_example(proto, features)
  return tf.reshape(parsed_features["X"], [1024,21]),  tf.reshape(parsed_features["Y"], [1024,1])

tfrecords_dataset = tf.data.TFRecordDataset(["gs://BUCKETNAME/test2.tfrecords"])
ds = tfrecords_dataset.map(parse_tf).batch(64)

Please see this excellent gist for how to generate the TFRecords from a numpy array.

https://gist.github.com/jekoehler/4e8a32187ce233f23d452cb4ee1ab5c8

来源：https://stackoverflow.com/questions/56608850/session-lost-with-keras-and-tpus-in-google-colab

标签

python-3.x

tensorflow

google-colaboratory

tpu