Session lost with Keras and TPUs in Google Colab

断了今生、忘了曾经 提交于 2019-12-13 20:25:39

问题


I have been trying to get the TPUs working for a classification project. The dataset is quite big, ~150gb, so I cannot load it all in memory. Thus I have been using Dask. Dask doesn't integrate with tf.Dataset directly so I have to create a loader inspired by parallelising tf.data.Dataset.from_generator

The dataset generates correctly when replacing the .fit with:

iterator = ds.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    for i in range(1):
        val = sess.run(next_element)
        print(val)

The testcode:

tf.keras.backend.clear_session()

N_chunk_generators=64
batch_size=128
chunk_size=8

def gen(chunk):
  for ibatch in range(chunk*chunk_size, (chunk+1)*chunk_size):
    yield (X[ibatch*(batch_size):(ibatch+1)*(batch_size)].compute().astype('float32'), 
      np.expand_dims(y[ibatch*(batch_size):(ibatch+1)*(batch_size)].compute().astype('float32'), axis=2))

def dataset_for_n(n):
  return tf.data.Dataset.from_generator(gen, 
                                      (tf.float32, tf.float32), 
                                      (tf.TensorShape([None, 1024, 21]), tf.TensorShape([None, 1024,1])),
                                       args=[n]
                                     )

ds = tf.data.Dataset.range(N_chunk_generators).flat_map(dataset_for_n)
ds = ds.prefetch(4 * batch_size).repeat()



def make_model():
  input_shape = (sample_length, 21)

  model = Sequential([
      LSTM(100, input_shape=input_shape, return_sequences=True),
      Dense(1,activation='sigmoid')
  ])

  model.compile(
      optimizer=tf.train.RMSPropOptimizer(learning_rate=0.01),
      loss='binary_crossentropy',
      metrics=['acc']
  )

  return model



TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']
resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)
tf.contrib.distribute.initialize_tpu_system(resolver)
strategy = tf.contrib.distribute.TPUStrategy(resolver)

with strategy.scope():
  model = make_model()
  model.summary()

model.fit(ds, epochs=1, steps_per_epoch=1)

But when using the .fit and TPUs the Session is lost:

W0615 08:41:46.915936 139858515244928 tpu_strategy_util.py:56] TPU system %s has already been initialized. Reinitializing the TPU can cause previously created variables on TPU to be lost.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm (LSTM)                  (None, 1024, 100)         48800     
_________________________________________________________________
dense (Dense)                (None, 1024, 1)           101       
=================================================================
Total params: 48,901
Trainable params: 48,901
Non-trainable params: 0
_________________________________________________________________
---------------------------------------------------------------------------
AbortedError                              Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1355     try:
-> 1356       return fn(*args)
   1357     except errors.OpError as e:

10 frames
AbortedError: Session 3de99dcb7d452e4f is not found.

During handling of the above exception, another exception occurred:

AbortedError                              Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1368           pass
   1369       message = error_interpolation.interpolate(message, self._graph)
-> 1370       raise type(e)(node_def, op, message)
   1371 
   1372   def _extend_graph(self):

AbortedError: Session 3de99dcb7d452e4f is not found.

回答1:


I think I have solved the problem, the issue is that the file is in the local filesystem which is not supported by TPU, the error message is very strange though.

Moving to TFRecords instead solved the problem:

def parse_tf(proto):
  print(proto)
  features = {"X": tf.FixedLenFeature([1024*21], tf.float32, default_value=None),
              "Y": tf.FixedLenFeature([1024], tf.float32, default_value=None),
              "x_shape": tf.FixedLenFeature([2], tf.int64, default_value=None),
              "y_shape": tf.FixedLenFeature([1], tf.int64, default_value=None)}
  parsed_features = tf.parse_single_example(proto, features)
  return tf.reshape(parsed_features["X"], [1024,21]),  tf.reshape(parsed_features["Y"], [1024,1])

tfrecords_dataset = tf.data.TFRecordDataset(["gs://BUCKETNAME/test2.tfrecords"])
ds = tfrecords_dataset.map(parse_tf).batch(64)

Please see this excellent gist for how to generate the TFRecords from a numpy array.

https://gist.github.com/jekoehler/4e8a32187ce233f23d452cb4ee1ab5c8



来源:https://stackoverflow.com/questions/56608850/session-lost-with-keras-and-tpus-in-google-colab

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!