问题
I have a python code that uses a java library by means of jpype. Currently, each run of my function checks if JVM exists, and creates it if it is not the case
import jpype as jp
def myfunc(i):
if not jp.isJVMStarted():
jp.startJVM(jp.getDefaultJVMPath(), '-ea', ('-Djava.class.path=' + jar_location))
do_something_hard(i)
Further, I want to parallelize my code using python multiprocessing library. Each thread (supposedly) works independently, calculating value of my function with different parameters. For example
import pathos
pool = pathos.multiprocessing.ProcessingPool(8)
params = np.arange(100)
result = pool.map(myfunc, params)
This construction works fine, except it has dramatic memory leaks when using more than 1 core in the pool. I notice that all memory is free up when python is closed, but memory still accumulates over time while pool.map
is running, which is undesirable. The jpype documentation is incredibly brief, suggesting to synchronize threads by wrapping python threads with jp.attachThreadToJVM
and jp.detachThreadToJVM
. However, I cannot find a single example online on how to actually do it. I have tried wrapping the function do_something_hard
inside myfunc
with these statements, but it had no effect on the leak. I had also attempted to explicitly close JVM at the end of myfunc
using jp.shutdownJVM
. However, in this case JVM seems to crash as soon as I have more than 1 core, leading me to believe that there is a race condition.
Please help:
- What is going on? Why would there be a race condition? Is it not the case, that each thread makes its own JVM?
- What is the correct way to free up memory in my scenario?
回答1:
The problem is with the nature of multiprocessing. Python can either fork or spawn a new process. The fork option appears to have significant problems with the JVM. The default on linux is fork.
Using the spawn context (multiprocessing.get_context("spawn")) to create a spawned version of Python will allow a fresh JVM to be created. Each spawned copy is completely independent. There are examples in the subrun.py in the test directory on github as that is what is used to test different JVM options for JPype.
The fork version creates a copy of the original process including the previous running JVM. At least from my testing the forked JVM does not work as expected. Older versions of JPype (0.6.x) would allow the forked version to call startJVM which would create a big memory leak. The current version 0.7.1 gives and exception that the JVM cannot be restarted.
If you are using threads (rather than processes), all threads share the same JVM and do not need to the JVM independently. There is further documentation on the use of multiprocessing with JPype in the latest documentation on github under the "limitations" section.
来源:https://stackoverflow.com/questions/58695140/memory-leaks-in-jpype-with-multiprocessing