Running multiple tensorflow sessions concurrently

前端 未结 3 1546
攒了一身酷
攒了一身酷 2020-12-31 05:35

I am trying to run several sessions of TensorFlow concurrently on a CentOS 7 machine with 64 CPUs. My colleague reports that he can use the following two blocks of code to p

相关标签:
3条回答
  • 2020-12-31 06:23

    From comment by OP (user1936768):

    I have good news: It turns out, on my system at least, my trial programs didn't execute long enough for the other instances of TF to start up. When I put a longer running example program in main, I do indeed see concurrent computations

    0 讨论(0)
  • 2020-12-31 06:23

    One possibility is that your sessions are trying to use 64 cores each and stomping on each other Perhaps try setting NUM_CORES to a lower value for each session

    sess = tf.Session(
        tf.ConfigProto(inter_op_parallelism_threads=NUM_CORES,
                       intra_op_parallelism_threads=NUM_CORES))
    
    0 讨论(0)
  • 2020-12-31 06:30

    This can be done elegantly with Ray, which is a library for parallel and distributed Python, which lets you train your models in parallel from a single Python script.

    This has the advantage of letting you parallelize "classes" by turning them in to "actors", which can be hard to do with regular Python multiprocessing. This is important because the expensive part of often initializing the TensorFlow graph. If you create an actor and then call the train method multiple times, the cost of initializing the graph is amortized.

    import numpy as np
    from tensorflow.examples.tutorials.mnist import input_data
    from PIL import Image
    import ray
    import tensorflow as tf
    import time
    
    
    @ray.remote
    class TrainingActor(object):
        def __init__(self, seed):
            print('Set new seed:', seed)
            np.random.seed(seed)
            tf.set_random_seed(seed)
            self.mnist = input_data.read_data_sets('MNIST_data/', one_hot=True)
    
            # Setting up the softmax architecture.
            self.x = tf.placeholder('float', [None, 784])
            W = tf.Variable(tf.zeros([784, 10]))
            b = tf.Variable(tf.zeros([10]))
            self.y = tf.nn.softmax(tf.matmul(self.x, W) + b)
    
            # Setting up the cost function.
            self.y_ = tf.placeholder('float', [None, 10])
            cross_entropy = -tf.reduce_sum(self.y_*tf.log(self.y))
            self.train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
    
            # Initialization
            self.init = tf.initialize_all_variables()
            self.sess = tf.Session(
                config=tf.ConfigProto(
                    inter_op_parallelism_threads=1,
                    intra_op_parallelism_threads=1
                )
            )
    
        def train(self):
            self.sess.run(self.init)
    
            for i in range(1000):
                batch_xs, batch_ys = self.mnist.train.next_batch(100)
                self.sess.run(self.train_step, feed_dict={self.x: batch_xs, self.y_: batch_ys})
    
            correct_prediction = tf.equal(tf.argmax(self.y, 1), tf.argmax(self.y_, 1))
            accuracy = tf.reduce_mean(tf.cast(correct_prediction, 'float'))
    
            return self.sess.run(accuracy, feed_dict={self.x: self.mnist.test.images,
                                                      self.y_: self.mnist.test.labels})
    
    
    if __name__ == '__main__':
        # Start Ray.
        ray.init()
    
        # Create 3 actors.
        training_actors = [TrainingActor.remote(seed) for seed in range(3)]
    
        # Make them all train in parallel.
        accuracy_ids = [actor.train.remote() for actor in training_actors]
        print(ray.get(accuracy_ids))
    
        # Start new training runs in parallel.
        accuracy_ids = [actor.train.remote() for actor in training_actors]
        print(ray.get(accuracy_ids))
    

    If you only want to create one copy of the dataset instead of having each actor read the dataset, you can rewrite things as follows. Under the hood, this uses the Plasma shared memory object store and the Apache Arrow data format.

    @ray.remote
    class TrainingActor(object):
        def __init__(self, mnist, seed):
            self.mnist = mnist
            ...
    
        ...
    
    if __name__ == "__main__":
        ray.init()
    
        # Read the mnist dataset and put it into shared memory once
        # so that workers don't create their own copies.
        mnist = input_data.read_data_sets('MNIST_data/', one_hot=True)
        mnist_id = ray.put(mnist)
    
        training_actors = [TrainingActor.remote(mnist_id, seed) for seed in range(3)]
    

    You can see more in the Ray documentation. Note I'm one of the Ray developers.

    0 讨论(0)
提交回复
热议问题