I am building a script to download and parse benefits information for health insurance plans on Obamacare exchanges. Part of this requires downloading and parsing the plan benef
dodysw has correctly pointed out that the common solution is to chunkify the inputs and submit chunks of tasks to the executor. He has also correctly pointed out that you lose some performance by waiting for each chunk to be processed completely before starting to process the next chunk.
I suggest a better solution that will feed a continuous stream of tasks to the executor while enforcing an upper bound on the maximum number of parallel tasks in order to keep the memory footprint low.
The trick is to use concurrent.futures.wait
to keep track of the futures that have been completed and those that are still pending completion:
def load_json_url(url):
try:
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
resp = urllib.request.urlopen(req).read().decode('utf8')
return json.loads(resp), None
except Exception as e:
return url, e
MAX_WORKERS = 6
with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
futures_done = set()
futures_notdone = set()
for url in formulary_urls:
futures_notdone.add(executor.submit(load_json_url, url))
if len(futures_notdone) >= MAX_WORKERS:
done, futures_notdone = concurrent.futures.wait(futures_notdone, return_when=concurrent.futures.FIRST_COMPLETED)
futures_done.update(done)
# Process results.
downloaded_plans = 0
for future in futures_done:
json, exc = future.result()
if exc:
print('%r generated an exception: %s' % (json, exc))
else:
downloaded_plans += 1
for item in data:
if item['rxnorm_id'] == drugid:
for row in item['plans']:
print(row['drug_tier'])
(plansid_dict[row['plan_id']])['drug_tier'] = row['drug_tier']
(plansid_dict[row['plan_id']])['prior_authorization'] = row['prior_authorization']
(plansid_dict[row['plan_id']])['step_therapy'] = row['step_therapy']
(plansid_dict[row['plan_id']])['quantity_limit'] = row['quantity_limit']
Of course, you could also process the results inside the loop regularly in order to empty the futures_done
from time to time. For example, you could do that each time the number of items in futures_done
exceeds 1000 (or any other amount that fits your needs). This might come in handy if your dataset is very large and the results alone would result in a lot of memory usage.