Memory usage with concurrent.futures.ThreadPoolExecutor in Python3

问题

I am building a script to download and parse benefits information for health insurance plans on Obamacare exchanges. Part of this requires downloading and parsing the plan benefit JSON files from each individual insurance company. In order to do this, I am using concurrent.futures.ThreadPoolExecutor with 6 workers to download each file (with urllib), parse and loop thru the JSON and extract the relevant info (which is stored in nested dictionary within the script).

(running Python 3.5.1 (v3.5.1:37a07cee5969, Dec 6 2015, 01:38:48) [MSC v.1900 32 bit (Intel)] on win32)

The problem is that when I do this concurrently, the script does not seem to release the memory after it has downloaded\parsed\looped thru a JSON file, and after a while, it crashes, with malloc raising a memory error.

When I do it serially--with a simple for in loop-- however,the program does not crash nor does it take an extreme amount of memory.

def load_json_url(url, timeout):
    req = urllib.request.Request(url, headers={ 'User-Agent' : 'Mozilla/5.0' })
    resp = urllib.request.urlopen(req).read().decode('utf8')
    return json.loads(resp) 



 with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
        # Start the load operations and mark each future with its URL
        future_to_url = {executor.submit(load_json_url, url, 60): url for url in formulary_urls}
        for future in concurrent.futures.as_completed(future_to_url):
            url = future_to_url[future]
            try:
                # The below timeout isn't raising the TimeoutError.
                data = future.result(timeout=0.01)
                for item in data:
                        if item['rxnorm_id']==drugid: 
                            for row in item['plans']:
                                print (row['drug_tier'])
                                (plansid_dict[row['plan_id']])['drug_tier']=row['drug_tier']
                                (plansid_dict[row['plan_id']])['prior_authorization']=row['prior_authorization']
                                (plansid_dict[row['plan_id']])['step_therapy']=row['step_therapy']
                                (plansid_dict[row['plan_id']])['quantity_limit']=row['quantity_limit']

            except Exception as exc:
                print('%r generated an exception: %s' % (url, exc))


            else:
                downloaded_plans=downloaded_plans+1

回答1:

It's not your fault. as_complete() doesn't release its futures until it completes. There's a issue logged already: https://bugs.python.org/issue27144

For now, I think the majority approach is to wrap as_complete() inside another loop that chunkify to a sane number of futures, depending on how much RAM you want to spend and how big your result will be. It'll block on each chunk until all job is gone before going to next chunk so be slower or potentially stuck in the middle for a long time, but I see no other way for now, though will keep this answer posted when there's a smarter way.

回答2:

As an alternative solution, you can call add_done_callback on your futures and not use as_completed at all. The key is NOT keeping references to futures. So future_to_url list in original question is a bad idea.

What I've done is basically:

def do_stuff(future):
    res = future.result()  # handle exceptions here if you need to

f = executor.submit(...)
f.add_done_callback(do_stuff)

回答3:

If you use the standard module “concurrent.futures” and want to simultaneously process several million data, then a queue of workers will take up all the free memory.

You can use bounded-pool-executor. https://github.com/mowshon/bounded_pool_executor

pip install bounded-pool-executor

example:

from bounded_pool_executor import BoundedProcessPoolExecutor
from time import sleep
from random import randint

def do_job(num):
    sleep_sec = randint(1, 10)
    print('value: %d, sleep: %d sec.' % (num, sleep_sec))
    sleep(sleep_sec)

with BoundedProcessPoolExecutor(max_workers=5) as worker:
    for num in range(10000):
        print('#%d Worker initialization' % num)
        worker.submit(do_job, num)

来源：https://stackoverflow.com/questions/37445540/memory-usage-with-concurrent-futures-threadpoolexecutor-in-python3

标签

python

json

concurrency

malloc

urllib