out of memory error when reading csv file in chunk

后端 未结 2 563
天涯浪人
天涯浪人 2021-01-02 18:54

I am processing a csv-file which is 2.5 GB big. The 2.5 GB table looks like this:

columns=[ka,kb_1,kb_2,timeofEvent,timeInterva         


        
2条回答
  •  野趣味
    野趣味 (楼主)
    2021-01-02 19:40

    Based on your snippet, when reading line-by-line.

    I assume that kb_2 is the error indicator,

    groups={}
    with open("data/petaJoined.csv", "r") as large_file:
        for line in large_file:
            arr=line.split('\t')
            #assuming this structure: ka,kb_1,kb_2,timeofEvent,timeInterval
            k=arr[0]+','+arr[1]
            if not (k in groups.keys())
                groups[k]={'record_count':0, 'error_sum': 0}
            groups[k]['record_count']=groups[k]['record_count']+1
            groups[k]['error_sum']=groups[k]['error_sum']+float(arr[2])
    for k,v in groups.items:
        print ('{group}: {error_rate}'.format(group=k,error_rate=v['error_sum']/v['record_count']))
    

    This code snippet stores all the groups in a dictionary, and calculates the error rate after reading the entire file.

    It will encounter an out-of-memory exception, if there are too many combinations of groups.

提交回复
热议问题