Parallelism in Python

后端 未结 5 973
星月不相逢
星月不相逢 2021-01-07 17:19

What are the options for achieving parallelism in Python? I want to perform a bunch of CPU bound calculations over some very large rasters, and would like to parallelise th

相关标签:
5条回答
  • 2021-01-07 18:02

    The new (2.6) multiprocessing module is the way to go. It uses subprocesses, which gets around the GIL problem. It also abstracts away some of the local/remote issues, so the choice of running your code locally or spread out over a cluster can be made later. The documentation I've linked above is a fair bit to chew on, but should provide a good basis to get started.

    0 讨论(0)
  • 2021-01-07 18:07

    Generally, you describe a CPU bound calculation. This is not Python's forte. Neither, historically, is multiprocessing.

    Threading in the mainstream Python interpreter has been ruled by a dreaded global lock. The new multiprocessing API works around that and gives a worker pool abstraction with pipes and queues and such.

    You can write your performance critical code in C or Cython, and use Python for the glue.

    0 讨论(0)
  • 2021-01-07 18:09

    Depending on how much data you need to process and how many CPUs/machines you intend to use, it is in some cases better to write a part of it in C (or Java/C# if you want to use jython/IronPython)

    The speedup you can get from that might do more for your performance than running things in parallel on 8 CPUs.

    0 讨论(0)
  • 2021-01-07 18:14

    Ray is an elegant (and fast) library for doing this.

    The most basic strategy for parallelizing Python functions is to declare a function with the @ray.remote decorator. Then it can be invoked asynchronously.

    import ray
    import time
    
    # Start the Ray processes (e.g., a scheduler and shared-memory object store).
    ray.init(num_cpus=8)
    
    @ray.remote
    def f():
        time.sleep(1)
    
    # This should take one second assuming you have at least 4 cores.
    ray.get([f.remote() for _ in range(4)])
    

    You can also parallelize stateful computation using actors, again by using the @ray.remote decorator.

    # This assumes you already ran 'import ray' and 'ray.init()'.
    
    import time
    
    @ray.remote
    class Counter(object):
        def __init__(self):
            self.x = 0
    
        def inc(self):
            self.x += 1
    
        def get_counter(self):
            return self.x
    
    # Create two actors which will operate in parallel.
    counter1 = Counter.remote()
    counter2 = Counter.remote()
    
    @ray.remote
    def update_counters(counter1, counter2):
        for _ in range(1000):
            time.sleep(0.25)
            counter1.inc.remote()
            counter2.inc.remote()
    
    # Start three tasks that update the counters in the background also in parallel.
    update_counters.remote(counter1, counter2)
    update_counters.remote(counter1, counter2)
    update_counters.remote(counter1, counter2)
    
    # Check the counter values.
    for _ in range(5):
        counter1_val = ray.get(counter1.get_counter.remote())
        counter2_val = ray.get(counter2.get_counter.remote())
        print("Counter1: {}, Counter2: {}".format(counter1_val, counter2_val))
        time.sleep(1)
    

    It has a number of advantages over the multiprocessing module:

    • The same code runs on a single multi-core machine as well as a large cluster.
    • Data is shared efficiently between processes on the same machine using shared memory and efficient serialization.
    • You can parallelize Python functions (using tasks) and Python classes (using actors).
    • Error messages are propagated nicely.

    Ray is a framework I've been helping develop.

    0 讨论(0)
  • 2021-01-07 18:17

    There are many packages to do that, the most appropriate as other said is multiprocessing, expecially with the class "Pool".

    A similar result can be obtained by parallel python, that in addition is designed to work with clusters.

    Anyway, I would say go with multiprocessing.

    0 讨论(0)
提交回复
热议问题