python -> multiprocessing module

前端未结

关注

 2  1019

别跟我提以往

Here\'s what I am trying to accomplish -

I have about a million files which I need to parse & append the parsed content to a single file.
Since a sin

相关标签:

2条回答

难免孤独

2021-02-04 18:49
Although the discussion with Eric was fruitful, later on I found a better way of doing this. Within the multiprocessing module there is a method called 'Pool' which is perfect for my needs.

It's optimizes itself to the number of cores my system has. i.e. only as many processes are spawned as the no. of cores. Of course this is customizable. So here's the code. Might help someone later-
```
from multiprocessing import Pool

def main():
    po = Pool()
    for file in glob.glob('*.csv'):
        filepath = os.path.join(DATA_DIR, file)
        po.apply_async(mine_page, (filepath,), callback=save_data)
    po.close()
    po.join()
    file_ptr.close()

def mine_page(filepath):
    #do whatever it is that you want to do in a separate process.
    return data

def save_data(data):
    #data is a object. Store it in a file, mysql or...
    return
```
Still going through this huge module. Not sure if save_data() is executed by parent process or this function is used by spawned child processes. If it's the child which does the saving it might lead to concurrency issues in some situations. If anyone has anymore experience in using this module, you appreciate more knowledge here...
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤独总比滥情好

2021-02-04 18:57

The docs for multiprocessing indicate several methods of sharing state between processes:

http://docs.python.org/dev/library/multiprocessing.html#sharing-state-between-processes

I'm sure each process gets a fresh interpreter and then the target (function) and args are loaded into it. In that case, the global namespace from your script would have been bound to your worker function, so the data_file would be there. However, I am not sure what happens to the file descriptor as it is copied across. Have you tried passing the file object as one of the args?

An alternative is to pass another Queue that will hold the results from the workers. The workers put the results and the main code gets the results and writes it to the file.

0 讨论(0)
发布评论:

提交评论
- 加载中...