Parallelism isn't reducing the time in dataset map

后端 未结 3 1394
北恋
北恋 2020-12-03 11:14

TF Map function supports parallel calls. I\'m seeing no improvements passing num_parallel_calls to map. With num_parallel_calls=1 and num_par

相关标签:
3条回答
  • 2020-12-03 11:36

    The problem here is that the only operation in the Dataset.map() function is a tf.py_func() op. This op calls back into the local Python interpreter to run a function in the same process. Increasing num_parallel_calls will increase the number of TensorFlow threads that attempt to call back into Python concurrently. However, Python has something called the "Global Interpreter Lock" that prevents more than one thread from executing code at once. As a result, all but one of these multiple parallel calls will be blocked waiting to acquire the Global Interpreter Lock, and there will be almost no parallel speedup (and perhaps even a slight slowdown).

    Your code example didn't include the definition of the squarer() function, but it might be possible to replace tf.py_func() with pure TensorFlow ops, which are implemented in C++, and can execute in parallel. For example—and just guessing by the name—you could replace it with an invocation of tf.square(x), and you might then enjoy some parallel speedup.

    Note however that if the amount of work in the function is small, like squaring a single integer, the speedup might not be very large. Parallel Dataset.map() is more useful for heavier operations, like parsing a TFRecord with tf.parse_single_example() or performing some image distortions as part of a data augmentation pipeline.

    0 讨论(0)
  • 2020-12-03 11:36

    The reason maybe the squarer cost less time than overhead time. I modified the code with adding a quarter function which cost 2 seconds. Then the parameter num_parallel_calls works as expected. Here is the complete code:

    import tensorflow as tf
    import time
    def squarer(x):
      t0 = time.time()
      while time.time() - t0 < 2:
        y = x ** 2
      return y
    
    def test_two_custom_function_parallelism(num_parallel_calls=1,
                                             batch=False,
                                             batch_size=1,
                                             repeat=1,
                                             num_iterations=10):
      tf.reset_default_graph()
      start = time.time()
      dataset_x = tf.data.Dataset.range(1000).map(
          lambda x: tf.py_func(squarer, [x], [tf.int64]),
          num_parallel_calls=num_parallel_calls).repeat(repeat)
      # dataset_x = dataset_x.prefetch(4)
      if batch:
        dataset_x = dataset_x.batch(batch_size)
      dataset_y = tf.data.Dataset.range(1000).map(
          lambda x: tf.py_func(squarer, [x], [tf.int64]),
          num_parallel_calls=num_parallel_calls).repeat(repeat)
      # dataset_y = dataset_y.prefetch(4)
      if batch:
        dataset_y = dataset_x.batch(batch_size)
        X = dataset_x.make_one_shot_iterator().get_next()
        Y = dataset_x.make_one_shot_iterator().get_next()
    
      with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        i = 0
        while True:
          t0 = time.time()
          try:
            res = sess.run([X, Y])
            print(res)
            i += 1
            if i == num_iterations:
              break
          except tf.errors.OutOfRangeError as e:
            print(i)
            break
          print('step elapse: %.4f' % (time.time() - t0))
      print('total time: %.4f' % (time.time() - start))
    
    
    test_two_custom_function_parallelism(
        num_iterations=4, num_parallel_calls=1, batch_size=2, batch=True, repeat=10)
    test_two_custom_function_parallelism(
        num_iterations=4, num_parallel_calls=10, batch_size=2, batch=True, repeat=10)
    

    the output is:

    [(array([0, 1]),), (array([0, 1]),)]
    step elapse: 4.0204
    [(array([4, 9]),), (array([4, 9]),)]
    step elapse: 4.0836
    [(array([16, 25]),), (array([16, 25]),)]
    step elapse: 4.1529
    [(array([36, 49]),), (array([36, 49]),)]
    total time: 16.3374
    [(array([0, 1]),), (array([0, 1]),)]
    step elapse: 2.2139
    [(array([4, 9]),), (array([4, 9]),)]
    step elapse: 0.0585
    [(array([16, 25]),), (array([16, 25]),)]
    step elapse: 0.0469
    [(array([36, 49]),), (array([36, 49]),)]
    total time: 2.5317
    

    So I am confused with the effect of "Global Interpreter Lock" mentioned by @mrry.

    0 讨论(0)
  • 2020-12-03 11:41

    I setup my own version of map to get something similar to the TensorFlow's Dataset.map, but which will use multiple CPUs for py_functions.

    Usage

    Instead of

    mapped_dataset = my_dataset.map(lambda x: tf.py_function(my_function, [x], [tf.float64]), num_parallel_calls=16)
    

    with the below code, you can get a CPU parallel py_function version using

    mapped_dataset = map_py_function_to_dataset(my_dataset, my_function, number_of_parallel_calls=16)
    

    (The output type(s) for the py_function can also be specified if it's not a single tf.float32)

    Internally, this creates a pool of multiprocessing workers. It still uses the single regular GIL limited TensorFlow map, but only to pass the input to a worker and get the output back. The workers processing the data happen in parallel on the CPU.

    Caveats

    The function passed needs to be picklable to work with the multiprocessing pool. This should work for most cases, but some closures or whatnot may fail. Packages like dill might loosen this restriction, but I haven't looked into that.

    If you pass an object's method as the function, you also need to be careful about how the object is duplicated across processes (each process will have its own copy of the object, so you can't rely on the attributes being shared).

    As long as these considerations are kept in mind, this code should work for many cases.

    Code

    """
    Code for TensorFlow's `Dataset` class which allows for multiprocessing in CPU map functions.
    """
    import multiprocessing
    from typing import Callable, Union, List
    import signal
    import tensorflow as tf
    
    
    class PyMapper:
        """
        A class which allows for mapping a py_function to a TensorFlow dataset in parallel on CPU.
        """
        def __init__(self, map_function: Callable, number_of_parallel_calls: int):
            self.map_function = map_function
            self.number_of_parallel_calls = number_of_parallel_calls
            self.pool = multiprocessing.Pool(self.number_of_parallel_calls, self.pool_worker_initializer)
    
        @staticmethod
        def pool_worker_initializer():
            """
            Used to initialize each worker process.
            """
            # Corrects bug where worker instances catch and throw away keyboard interrupts.
            signal.signal(signal.SIGINT, signal.SIG_IGN)
    
        def send_to_map_pool(self, element_tensor):
            """
            Sends the tensor element to the pool for processing.
    
            :param element_tensor: The element to be processed by the pool.
            :return: The output of the map function on the element.
            """
            result = self.pool.apply_async(self.map_function, (element_tensor,))
            mapped_element = result.get()
            return mapped_element
    
        def map_to_dataset(self, dataset: tf.data.Dataset,
                           output_types: Union[List[tf.dtypes.DType], tf.dtypes.DType] = tf.float32):
            """
            Maps the map function to the passed dataset.
    
            :param dataset: The dataset to apply the map function to.
            :param output_types: The TensorFlow output types of the function to convert to.
            :return: The mapped dataset.
            """
            def map_py_function(*args):
                """A py_function wrapper for the map function."""
                return tf.py_function(self.send_to_map_pool, args, output_types)
            return dataset.map(map_py_function, self.number_of_parallel_calls)
    
    
    def map_py_function_to_dataset(dataset: tf.data.Dataset, map_function: Callable, number_of_parallel_calls: int,
                                   output_types: Union[List[tf.dtypes.DType], tf.dtypes.DType] = tf.float32
                                   ) -> tf.data.Dataset:
        """
        A one line wrapper to allow mapping a parallel py function to a dataset.
    
        :param dataset: The dataset whose elements the mapping function will be applied to.
        :param map_function: The function to map to the dataset.
        :param number_of_parallel_calls: The number of parallel calls of the mapping function.
        :param output_types: The TensorFlow output types of the function to convert to.
        :return: The mapped dataset.
        """
        py_mapper = PyMapper(map_function=map_function, number_of_parallel_calls=number_of_parallel_calls)
        mapped_dataset = py_mapper.map_to_dataset(dataset=dataset, output_types=output_types)
        return mapped_dataset
    
    0 讨论(0)
提交回复
热议问题