Parallelism in Python

后端未结

关注

 5  974

What are the options for achieving parallelism in Python? I want to perform a bunch of CPU bound calculations over some very large rasters, and would like to parallelise th

相关标签:

5条回答

天命终不由人

2021-01-07 18:02

The new (2.6) multiprocessing module is the way to go. It uses subprocesses, which gets around the GIL problem. It also abstracts away some of the local/remote issues, so the choice of running your code locally or spread out over a cluster can be made later. The documentation I've linked above is a fair bit to chew on, but should provide a good basis to get started.

0 讨论(0)
发布评论:

提交评论
- 加载中...
有刺的猬

2021-01-07 18:07

Generally, you describe a CPU bound calculation. This is not Python's forte. Neither, historically, is multiprocessing.

Threading in the mainstream Python interpreter has been ruled by a dreaded global lock. The new multiprocessing API works around that and gives a worker pool abstraction with pipes and queues and such.

You can write your performance critical code in C or Cython, and use Python for the glue.

0 讨论(0)
发布评论:

提交评论
- 加载中...
既然无缘

2021-01-07 18:09

Depending on how much data you need to process and how many CPUs/machines you intend to use, it is in some cases better to write a part of it in C (or Java/C# if you want to use jython/IronPython)

The speedup you can get from that might do more for your performance than running things in parallel on 8 CPUs.

0 讨论(0)
发布评论:

提交评论
- 加载中...

一个人的身影

2021-01-07 18:14

Ray is an elegant (and fast) library for doing this.

The most basic strategy for parallelizing Python functions is to declare a function with the @ray.remote decorator. Then it can be invoked asynchronously.

import ray
import time

# Start the Ray processes (e.g., a scheduler and shared-memory object store).
ray.init(num_cpus=8)

@ray.remote
def f():
    time.sleep(1)

# This should take one second assuming you have at least 4 cores.
ray.get([f.remote() for _ in range(4)])

You can also parallelize stateful computation using actors, again by using the @ray.remote decorator.

# This assumes you already ran 'import ray' and 'ray.init()'.

import time

@ray.remote
class Counter(object):
    def __init__(self):
        self.x = 0

    def inc(self):
        self.x += 1

    def get_counter(self):
        return self.x

# Create two actors which will operate in parallel.
counter1 = Counter.remote()
counter2 = Counter.remote()

@ray.remote
def update_counters(counter1, counter2):
    for _ in range(1000):
        time.sleep(0.25)
        counter1.inc.remote()
        counter2.inc.remote()

# Start three tasks that update the counters in the background also in parallel.
update_counters.remote(counter1, counter2)
update_counters.remote(counter1, counter2)
update_counters.remote(counter1, counter2)

# Check the counter values.
for _ in range(5):
    counter1_val = ray.get(counter1.get_counter.remote())
    counter2_val = ray.get(counter2.get_counter.remote())
    print("Counter1: {}, Counter2: {}".format(counter1_val, counter2_val))
    time.sleep(1)

It has a number of advantages over the multiprocessing module:

The same code runs on a single multi-core machine as well as a large cluster.
Data is shared efficiently between processes on the same machine using shared memory and efficient serialization.
You can parallelize Python functions (using tasks) and Python classes (using actors).
Error messages are propagated nicely.

Ray is a framework I've been helping develop.

0 讨论(0)

执念已碎

2021-01-07 18:17

There are many packages to do that, the most appropriate as other said is multiprocessing, expecially with the class "Pool".

A similar result can be obtained by parallel python, that in addition is designed to work with clusters.

Anyway, I would say go with multiprocessing.

0 讨论(0)
发布评论:

提交评论
- 加载中...