问题
I have a gzip file handle that I'm writing to from a multiprocessing pool. Unfortunately, the output file seems to become corrupted after a certain point, so doing something like zcat out | wc
gives:
gzip: out: invalid compressed data--format violated
I'm dealing with the problem by not using gzip. But I'm curious as to why this is happening and if there is any solution.
Not sure if it matters, but I'm running the code on a remote linux machine that I don't control but my guess is that it's an ubuntu machine. Python 2.7.3
And here's the slightly simplified code:
lock = Lock()
ohandle = gzip.open("out", "w")
def process(fn):
rv = []
for l in open(fn):
sometext = dosomething(l)
rv.append(sometext)
lock.acquire()
for sometext in rv:
print >> ohandle, sometext
lock.release()
pool = Pool(processes=4)
pm = pool.map(process, some_file_list])
ohandle.close()
回答1:
See http://docs.python.org/2/library/multiprocessing.html#programming-guidelines
- You should guard calling part with "if name == main...". Or that part will be run by child process.
- Explicitly pass resources to child processes. (ohandle, lock)
I modified your code to not use lock and not to share ohandle. Instead I used temporary file. (fn + '.temp')
Caution: You should check filenames. If there is any file with '.temp' suffix, my code could delete your data.
import os
def process(fn):
out_fn = fn + '.temp'
with open(fn) as f, open(out_fn, 'w') as f2:
for l in f:
sometext = dosomething(l)
print >> f2, sometext
return out_fn
if __name__ == '__main__':
some_file_list = ...
pool = Pool(processes=4)
ohandle = gzip.open('out.gz', 'w')
for fn in pool.map(process, some_file_list):
with open(fn) as f:
while True:
data = f.read(1<<12)
if not data: break
ohandle.write(data)
os.unlink(fn)
pool.close()
pool.join()
来源:https://stackoverflow.com/questions/17016029/gzip-issue-with-multiprocessing-pool