I am trying to load numpy files asynchronously in a Pool:
self.pool = Pool(2, maxtasksperchild = 1)
...
nextPackage = self.pool.apply_async(loadPackages, (..
It think I've found a workaround by retrieving data in small chunks. In my case it was a list of lists.
I had:
for i in range(0, NUMBER_OF_THREADS):
print('MAIN: Getting data from process ' + str(i) + ' proxy...')
X_train.extend(ListasX[i]._getvalue())
Y_train.extend(ListasY[i]._getvalue())
ListasX[i] = None
ListasY[i] = None
gc.collect()
Changed to:
CHUNK_SIZE = 1024
for i in range(0, NUMBER_OF_THREADS):
print('MAIN: Getting data from process ' + str(i) + ' proxy...')
for k in range(0, len(ListasX[i]), CHUNK_SIZE):
X_train.extend(ListasX[i][k:k+CHUNK_SIZE])
Y_train.extend(ListasY[i][k:k+CHUNK_SIZE])
ListasX[i] = None
ListasY[i] = None
gc.collect()
And now it seems to work, possibly by serializing less data at a time. So maybe if you can segment your data into smaller portions you can overcome the issue. Good luck!
There is a bug in Python C core code that prevents data responses bigger than 2GB return correctly to the main thread. you need to either split the data into smaller chunks as suggested in the previous answer or not use multiprocessing for this function
I reported this bug to python bugs list (https://bugs.python.org/issue34563) and created a PR (https://github.com/python/cpython/pull/9027) to fix it, but it probably will take a while to get it released (UPDATE: the fix is present in python 3.8.0+)
if you are interested you can find more details on what causes the bug in the bug description in the link I posted