I am processing a csv
-file which is 2.5 GB big. The 2.5 GB table looks like this:
columns=[ka,kb_1,kb_2,timeofEvent,timeInterva
Based on your snippet, when reading line-by-line.
I assume that kb_2
is the error indicator,
groups={}
with open("data/petaJoined.csv", "r") as large_file:
for line in large_file:
arr=line.split('\t')
#assuming this structure: ka,kb_1,kb_2,timeofEvent,timeInterval
k=arr[0]+','+arr[1]
if not (k in groups.keys())
groups[k]={'record_count':0, 'error_sum': 0}
groups[k]['record_count']=groups[k]['record_count']+1
groups[k]['error_sum']=groups[k]['error_sum']+float(arr[2])
for k,v in groups.items:
print ('{group}: {error_rate}'.format(group=k,error_rate=v['error_sum']/v['record_count']))
This code snippet stores all the groups in a dictionary, and calculates the error rate after reading the entire file.
It will encounter an out-of-memory exception, if there are too many combinations of groups.