Google Colab: Why is CPU faster than TPU?

问题

I'm using Google colab TPU to train a simple Keras model. Removing the distributed strategy and running the same program on the CPU is much faster than TPU. How is that possible?

import timeit
import os
import tensorflow as tf
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

# Load Iris dataset
x = load_iris().data
y = load_iris().target

# Split data to train and validation set
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.30, shuffle=False)

# Convert train data type to use TPU 
x_train = x_train.astype('float32')
x_val = x_val.astype('float32')

# Specify a distributed strategy to use TPU
resolver = tf.contrib.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.contrib.distribute.initialize_tpu_system(resolver)
strategy = tf.contrib.distribute.TPUStrategy(resolver)

# Use the strategy to create and compile a Keras model
with strategy.scope():
  model = Sequential()
  model.add(Dense(32, input_shape=(4,), activation=tf.nn.relu, name="relu"))
  model.add(Dense(3, activation=tf.nn.softmax, name="softmax"))
  model.compile(optimizer=Adam(learning_rate=0.1), loss='logcosh')

start = timeit.default_timer()

# Fit the Keras model on the dataset
model.fit(x_train, y_train, batch_size=20, epochs=20, validation_data=[x_val, y_val], verbose=0, steps_per_epoch=2)

print('\nTime: ', timeit.default_timer() - start)

回答1:

Thank you for your question.

I think what's happening here is a matter of overhead -- since the TPU runs on a separate VM (accessible at grpc://$COLAB_TPU_ADDR), each call to run a model on the TPU incurs some amount of overhead as the client (the Colab notebook in this case) sends a graph to the TPU, which is then compiled and run. This overhead is small compared to the time it takes to run e.g. ResNet50 for one epoch, but large compared to run a simple model like the one in your example.

For best results on TPU we recommend using tf.data.Dataset. I updated your example for TensorFlow 2.2:

%tensorflow_version 2.x
import timeit
import os
import tensorflow as tf
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

# Load Iris dataset
x = load_iris().data
y = load_iris().target

# Split data to train and validation set
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.30, shuffle=False)

# Convert train data type to use TPU 
x_train = x_train.astype('float32')
x_val = x_val.astype('float32')

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)

train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(20)
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val)).batch(20)

# Use the strategy to create and compile a Keras model
with strategy.scope():
  model = Sequential()
  model.add(Dense(32, input_shape=(4,), activation=tf.nn.relu, name="relu"))
  model.add(Dense(3, activation=tf.nn.softmax, name="softmax"))
  model.compile(optimizer=Adam(learning_rate=0.1), loss='logcosh')

start = timeit.default_timer()

# Fit the Keras model on the dataset
model.fit(train_dataset, epochs=20, validation_data=val_dataset)

print('\nTime: ', timeit.default_timer() - start)

This takes about 30 seconds to run, compared to ~1.3 seconds to run on CPU. We can substantially reduce the overhead here by repeating the dataset and running one long epoch rather than several small ones. I replaced the dataset setup with this:

train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).repeat(20).batch(20)
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val)).batch(20)

And replaced the fit call with this:

model.fit(train_dataset, validation_data=val_dataset)

This brings the runtime down to about 6 seconds for me. This is still slower than CPU, but that's not surprising for such a small model that can easily be run locally. In general, you'll see more benefit from using TPUs with larger models. I recommend looking through TensorFlow's official TPU guide, which presents a larger image classification model for the MNIST dataset.

回答2:

This is probably due to the batch size you are using. In comparison to CPU and GPU, the training speed of a TPU is highly dependent on the batch size. Check the following site for more information: https://cloud.google.com/tpu/docs/performance-guide

The Cloud TPU hardware is different from CPUs and GPUs. At a high level, CPUs can be characterized as having a low number of high performing threads. GPUs can be characterized as having a very high number of low performing threads. A Cloud TPU, with its 128 x 128 matrix unit, can be thought of as either a single, very powerful thread, which can perform 16K ops per cycle, or 128 x 128 tiny, simple threads that are connected in pipeline fashion. Correspondingly, when addressing memory, multiples of 8 (floats) are desirable, as well as multiples of 128 for operations targeting the matrix unit.

This means that the batch size should be a multiple of 128, depending on the number of TPUs. Google Colab provides 8 TPUs to you, so in the best case you should select a batch size of 128 * 8 = 1024.

来源：https://stackoverflow.com/questions/59264851/google-colab-why-is-cpu-faster-than-tpu

标签

tensorflow

keras

deep-learning

google-colaboratory

google-cloud-tpu