How to multiprocess, multithread a big file by dividing into small chunks based on values of a particular column?

匿名 (未验证) 提交于 2019-12-03 02:38:01

问题:

I have written a python program for a biological process https://codereview.stackexchange.com/questions/186396/solve-the-phase-state-between-two-haplotype-blocks-using-markov-transition-proba .

If you look into that program you can see that the program takes a lots of time in computing data from two consecutive lines (or keys, vals) at a time. I am not putting the whole code here, but for simplicity I am creating a mock file and mock program (given below) which behaves similarly at simplest level. In this mock program I am calculating, say len(vals) column and writing it back to an output file.

Since the computation is CPU/GPU bound while doing for (k1, v1) and (k2, v2) .... in the original program (above link), I want to multiprocess/thread the data analyses by - 1) reading whole data in memory in a fastest possible way 2) divide the data into chunks by unique chr field 3) do the computation 4) write it back to a file. So, How would I do it?

In the given mock file, computation is too simple to be GPU/CPU bound, but I just want to know how I can do it if need be.

Note: I had too many people asking what am I trying to achieve - I am trying to multiprocess/thread the given problem. If I put my original whole big program here, nobody is going to look at it. So, lets workout this small file and the small python program.

Below is my code and data:

my_data = '''chr\tpos\tidx\tvals 2\t23\t4\tabcd 2\t25\t7\tatg 2\t29\t8\tct 2\t35\t1\txylfz 3\t37\t2\tmnost 3\t39\t3\tpqr 3\t41\t6\trtuv 3\t45\t5\tlfghef 3\t39\t3\tpqr 3\t41\t6\trtu 3\t45\t5\tlfggg 4\t25\t3\tpqrp 4\t32\t6\trtu 4\t38\t5\tlfgh 4\t51\t3\tpqr 4\t57\t6\trtus '''   def manipulate_lines(vals):     vals_len = len(vals[3])     return write_to_file(vals[0:3], vals_len)  def write_to_file(a, b):     print(a,b)     to_file = open('write_multiprocessData.txt', 'a')     to_file.write('\t'.join(['\t'.join(a), str(b), '\n']))     to_file.close()  def main():     to_file = open('write_multiprocessData.txt', 'w')     to_file.write('\t'.join(['chr', 'pos', 'idx', 'vals', '\n']))     to_file.close()      data = my_data.rstrip('\n').split('\n')       for lines in data:         if lines.startswith('chr'):             continue         else:             lines = lines.split('\t')         manipulate_lines(lines)   if __name__ == '__main__':     main() 

回答1:

An issue to handle when using multiple processes to handle data, is to preserve order. Python has come up with a rather nice way of handling this, using a multiprocessing.Pool, which can be used to map the processes over the input data. This will then take care of returning the results in order.

However, the processing may still be out of order, so to use it properly, only processing, and no IO access should be run in the subprocesses. Therefore, to use this in your case, a small rewrite of your code needs to be performed, that have all IO operations happening in the main process:

from multiprocessing import Pool from time import sleep from random import randint  my_data = '''chr\tpos\tidx\tvals 2\t23\t4\tabcd 2\t25\t7\tatg 2\t29\t8\tct 2\t35\t1\txylfz 3\t37\t2\tmnost 3\t39\t3\tpqr 3\t41\t6\trtuv 3\t45\t5\tlfghef 3\t39\t3\tpqr 3\t41\t6\trtu 3\t45\t5\tlfggg 4\t25\t3\tpqrp 4\t32\t6\trtu 4\t38\t5\tlfgh 4\t51\t3\tpqr 4\t57\t6\trtus '''  def manipulate_lines(vals):     sleep(randint(0, 2))     vals_len = len(vals[3])     return vals[0:3], vals_len  def write_to_file(a, b):     print(a,b)     to_file = open('write_multiprocessData.txt', 'a')     to_file.write('\t'.join(['\t'.join(a), str(b), '\n']))     to_file.close()  def line_generator(data):     for line in data:         if line.startswith('chr'):             continue         else:            yield line.split('\t')  def main():     p = Pool(5)      to_file = open('write_multiprocessData.txt', 'w')     to_file.write('\t'.join(['chr', 'pos', 'idx', 'vals', '\n']))     to_file.close()      data = my_data.rstrip('\n').split('\n')      lines = line_generator(data)     results = p.map(manipulate_lines, lines)      for result in results:         write_to_file(*result)  if __name__ == '__main__':     main() 

This program does not split the list after its different chr values, but instead it processes entry by entry, directly from the list in maximally 5 (argument to Pool) sub-processes.

To show that the data is still in the expected order, I added a random sleep delay to the manipulate_lines function. This shows the concept but may not give a correct view of the speedup, since a sleeping process allows another one to run in parallel, whereas a compute-heavy process will use the CPU for all of its run time.

As can be seen, the writing to file has then to be done, once the map call returns, which assures that all subprocesses has been terminated and returned their results. There is quite some overhead for this communication behind the scene, so for this to be beneficial, the compute part must be substantially longer than the write phase, and it must not generate too much data to write to file.

In addition, I have also broken out the for loop in a generator. This is so that input to the multiprocessing.Pool is available upon request. Another way would be to pre-process the data list and then pass that list directly to the Pool. I find the generator solution to be nicer, though, and have smaller peak memory consumption.

Also, a comment on multithreading vs multiprocessing; as long as you do compute-heavy operations, you should use multiprocessing, which, at least in theory, allows the processes to run on different machines. In addition, in cPython - the most used Python implementation - threads hit upon another issue, which is the global interpreter lock (GIL).This means that only one thread can execute at a time, since the interpreter blocks access for all other threads. (There are some exceptions, e.g. when using modules written in C, like numpy. In these cases the GIL can be released while doing numpy calculations, but in general this is not the case.) Thus, threads are mainly for situations where your program is stuck waiting for slow, out-of-order, IO. (Sockets, terminal input, et c.)



回答2:

I've only used threading a few times, and I have not tested this code below, but from a quick glance, the for loop is really the only place that could benefit from threading.

I'll let other people decide though.

import threading  my_data = '''chr\tpos\tidx\tvals 2\t23\t4\tabcd 2\t25\t7\tatg 2\t29\t8\tct 2\t35\t1\txylfz 3\t37\t2\tmnost 3\t39\t3\tpqr 3\t41\t6\trtuv 3\t45\t5\tlfghef 3\t39\t3\tpqr 3\t41\t6\trtu 3\t45\t5\tlfggg 4\t25\t3\tpqrp 4\t32\t6\trtu 4\t38\t5\tlfgh 4\t51\t3\tpqr 4\t57\t6\trtus '''   def manipulate_lines(vals):     vals_len = len(vals[3])     return write_to_file(vals[0:3], vals_len)  def write_to_file(a, b):     print(a,b)     to_file = open('write_multiprocessData.txt', 'a')     to_file.write('\t'.join(['\t'.join(a), str(b), '\n']))     to_file.close()  def main():     to_file = open('write_multiprocessData.txt', 'w')     to_file.write('\t'.join(['chr', 'pos', 'idx', 'vals', '\n']))     to_file.close()      data = my_data.rstrip('\n').split('\n')      for lines in data:         if not lines.startswith('chr'):             lines = lines.split('\t')         threading.Thread(target = manipulate_lines, args = (lines)).start()   if __name__ == '__main__':     main() 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!