Running out of memory on python product iteration chain

妖精的绣舞 提交于 2021-01-29 19:52:48

问题


I am trying to build a list of possible string combinations to then iterate against it. I am running out of memory executing the below line, which I get because it's several billion lines.

data = list(map(''.join,chain.from_iterable(product(string.digits+string.ascii_lowercase+'/',repeat = i) for i in range(0,7))))

So I think, rather than creating this massive iterable list, I create it and execute against it in waves with some kind of "holding string" that I save to memory and can restart from when I want. IE, generate and iterate against a million rows, then save the holding string to file. Then start up again with the next million rows, but start my mapping/iterations at the "holding string" or the next row. I have no clue how to do that. I think I might have to not use the .from_iterable(product( code that I had implemented. If that idea is not clear (or is clear but stupid) let me know.

Also, another option rather than breaking up the memory issue, would be to somehow optimize the iterable list itself, I'm not sure how I would do that either. I'm trying to map an API that has no existing documentation. While I don't know that a non-exhaustive list is the route to take, I'm certainly open to suggestions.

Here is the code chunk I've been using:

import csv
import string
from itertools import product, chain

#Open stringfile. If it doesn't exist, create it
try:
     with open(stringfile) as f:
     reader = csv.reader(f,delimiter=',')
     data = list(reader)
     f.close()
except:
     data = list(map(''.join, chain.from_iterable(product(string.digits+string.ascii_lowercase + '/', repeat = i) for i in range(0,6))))
     f=open(stringfile,'w')
     f.write(str('\n.join(data)))
     f.close()
     pass

#Iterate against
...

EDIT: Further poking at this led me to this thread, which is similar topic. There is discussion about using islice, which helps me post-mapping (the script crashed last night while doing the API calls due to an error with my exception handling). I just restarted it at the 400k-th iterable.

Can I use .islice within a product? So for the generator, generate items 10mil-12mil (for example) and operate on just those items as a way to preserve memory?

Here is the most recent snippet of what I'm doing. You can see I plugged in the islice further down in the actual iteration, but I want to islice in the actual generation (the data = line).

#Open stringfile. If it doesn't exist, create it
try:
    with open(stringfile) as f:
        reader = csv.reader(f,delimiter=',')
        data = list(reader)
    f.close()
except:
    data = list(map(''.join, chain.from_iterable(product(string.digits + string.ascii_lowercase + '/',repeat = i) for i in range(3,5))))
    f=open(stringfile,'w')
    f.write(str('\n'.join(data)))
    f.close()
    pass

print("Total items: " + str(len(data)-substart))
fdf = pd.DataFrame()
sdf = pd.DataFrame()
qdf = pd.DataFrame()
attctr = 0
#Iterate through the string combination list
for idx,kw in islice(enumerate(data),substart,substop):
    #Attempt API call. Do the cooldown function if there is an issue.
    if idx/1000 == int(idx/1000):
        print("Iteration " + str(idx) + " of " + str(len(data)))
    attctr +=1
    if attctr == attcd:
        print("Cooling down!")
        time.sleep(cdtimer)
        attctr = 0
    try:
....

来源:https://stackoverflow.com/questions/65743187/running-out-of-memory-on-python-product-iteration-chain

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!