How can I parallelize a pipeline of generators/iterators in Python?

前端 未结 2 1150
囚心锁ツ
囚心锁ツ 2021-02-20 02:17

Suppose I have some Python code like the following:

input = open(\"input.txt\")
x = (process_line(line) for line in input)
y = (process_item(item) for item in x)         


        
相关标签:
2条回答
  • 2021-02-20 02:37

    You can't really parallelize reading from or writing to files; these will be your bottleneck, ultimately. Are you sure your bottleneck here is CPU, and not I/O?

    Since your processing contains no dependencies (according to you), it's trivially simple to use Python's multiprocessing.Pool class.

    There are a couple ways to write this, but the easier w.r.t. debugging is to find independent critical paths (slowest part of the code), which we will make run parallel. Let's presume it's process_item.

    …And that's it, actually. Code:

    import multiprocessing.Pool
    
    p = multiprocessing.Pool() # use all available CPUs
    
    input = open("input.txt")
    x = (process_line(line) for line in input)
    y = p.imap(process_item, x)
    z = (generate_output_line(item) + "\n" for item in y)
    output = open("output.txt", "w")
    output.writelines(z)
    

    I haven't tested it, but this is the basic idea. Pool's imap method makes sure results are returned in the right order.

    0 讨论(0)
  • 2021-02-20 02:45

    is there any easy way to make it so that multiple lines can be in the pipeline at once

    I wrote a library to do just this: https://github.com/michalc/threaded-buffered-pipeline, that iterates over each iterable in a separate thread.

    So what was

    input = open("input.txt")
    
    x = (process_line(line) for line in input)
    y = (process_item(item) for item in x)
    z = (generate_output_line(item) + "\n" for item in y)
    
    output = open("output.txt", "w")
    output.writelines(z)
    

    becomes

    from threaded_buffered_pipeline import buffered_pipeline
    
    input = open("input.txt")
    
    buffer_iterable = buffered_pipeline()
    x = buffer_iterable((process_line(line) for line in input))
    y = buffer_iterable((process_item(item) for item in x))
    z = buffer_iterable((generate_output_line(item) + "\n" for item in y))
    
    output = open("output.txt", "w")
    output.writelines(z)
    

    How much actual parallelism this adds depends on what's actually happening in each iterable, and how many CPU cores you have/how busy they are.

    The classic example is the Python GIL: if each step is fairly CPU heavy, and just uses Python, then not much parallelism would be added, and this might not be faster than the serial version. On the other hand, if each is network IO heavy, then I think it's likely to be faster.

    0 讨论(0)
提交回复
热议问题