Calculate running total from csv line by line

时光总嘲笑我的痴心妄想 提交于 2019-12-24 18:46:36

问题


I'm loading in a csv file line by line because it has ~800 million lines in it and there are many of these files which I need to analyse so loading in parallel is paramount and loading line by line is also required so as to not blow up the memory.

I have been given an answer to how to calculate the number of entries in which unique IDs are present throughout the dataset using collections.Counter(). (see Counting csv column occurrences on the fly in Python)

But is there a way to calculate a running total of data entries in another column of a read-in line for each unique ID from another column?

eg. suppose that the data in your csv file had only two columns and therefore looked like the following:

 [1 1]
 [1 1]
 [2 2]
 [3 2]
 [2 2]
 [1 2]

Where the second column contains the unique IDs for which you want to keep a running total of values in the first column. So your output should look like the following:

{'1': 2, '2': 8}

Where for ID '1' in column two, the total is given by 1+1 in column one. And for ID '2' in column one, the total is given by 2+3+2+1.

How can one do this quickly given the vast size of the csv I'm working with?

import csv

features = {}

with open(filename) as f:
        reader = csv.reader(f,delimiter=',')                
        for row in reader:            
                ID = row[1]               
                if SrcDevice not in features.keys():
                        features[ID] = {}
                        features[ID]['Some_feature'] = 0                        
                features[SrcDevice]['Some_feature'] += float(row[0])

But doing so with so many lines takes far far too long. The idea here is that I would also create a similar dictionary but with number of occurrences so that I can divide the features dictionary by that dictionary to calculate a mean of the features. (Seems unnecessary but remember that these files are so large they are read in line by line).

Using sqlite3 has been mentioned in the linked question but I would be interested in seeing if this can be done efficiently in Python first. Thanks in advance!

来源:https://stackoverflow.com/questions/53543951/calculate-running-total-from-csv-line-by-line

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!