问题
I am trying to to make use of multiprocessing across several different computers, which pathos
seems geared towards: "Pathos is a framework for heterogenous computing. It primarily provides the communication mechanisms for configuring and launching parallel computations across heterogenous resources." In looking at the documentation, however, I am at a loss as to how to get a cluster up and running. I am looking to:
- Set up a remote server or set of remote servers with secure authentication.
- Securely connect the the remote server(s).
- Map a task across all CPUs in both the remote servers and my local machine using a straightforward API like
pool.map
in the standard multiprocessing package (like the pseudocode in this related question).
I do not see an example for (1) and I do not understand the tunnel example provided for (2). The example does not actually connect to an existing service on the localhost. I would also like to know if/how I can require this communication to come with a password/key of some kind that would prevent someone else from connecting to the server. I understand this uses SSH authentication, but absent a preexisting key that only insures that the traffic is not read as it passes over the Internet, but does nothing to prevent someone else from hijacking the server.
回答1:
I'm the pathos
author. Basically, for (1) you can use pathos.pp
to connect to another computer through a socket connection. pathos.pp
has almost exactly the same API as pathos.multiprocessing
, although with pathos.pp
you can give the address and port of a remote host to connect to, using the keyword servers
when setting up the Pool
.
However, if you want to make a secure connection with SSH, it's best to establish a SSH-tunnel connection (as in the example you linked to), and then pass localhost
and the local port number to the servers
keyword in Pool
. This will then connect to the remote pp-worker
through the ssh tunnel. See:
https://github.com/uqfoundation/pathos/blob/master/examples/test_ppmap2.py and
http://www.cacr.caltech.edu/~mmckerns/pathos.html
Lastly, if you are using pathos.pp
with a remote server, as above, you should be already doing (3). However, it can be more efficient (for an embarrassingly parallel enough set of jobs), that you nest the parallel maps… so first use pathos.pp.ParallelPythonPool
to build a parallel map across servers, then call a N
-way job using a parallel map in pathos.multiprocessing.ProcessingPool
inside the function you are mapping with pathos.pp
. This will minimize the communication across the remote connection.
Also, you don't need to give a SSH password, if you have ssh-agent working for you. See: http://mah.everybody.org/docs/ssh. Pathos assumes for parallel maps across remote servers, you will have ssh-agent working and you won't need to type your password every time there's a connection.
EDIT: added example code on your question here: Python Multiprocessing with Distributed Cluster
来源:https://stackoverflow.com/questions/26939704/python-multiprocessing-with-distributed-cluster-using-pathos