问题
I'm loading in a csv file line by line because it has ~800 million lines in it and there are many of these files which I need to analyse so loading in parallel is paramount and loading line by line is also required so as to not blow up the memory.
I have been given an answer to how to calculate the number of entries in which unique IDs are present throughout the dataset using collections.Counter()
. (see Counting csv column occurrences on the fly in Python)
But is there a way to calculate a running total of data entries in another column of a read-in line for each unique ID from another column?
eg. suppose that the data in your csv file had only two columns and therefore looked like the following:
[1 1]
[1 1]
[2 2]
[3 2]
[2 2]
[1 2]
Where the second column contains the unique IDs for which you want to keep a running total of values in the first column. So your output should look like the following:
{'1': 2, '2': 8}
Where for ID '1' in column two, the total is given by 1+1 in column one. And for ID '2' in column one, the total is given by 2+3+2+1.
How can one do this quickly given the vast size of the csv I'm working with?
import csv
features = {}
with open(filename) as f:
reader = csv.reader(f,delimiter=',')
for row in reader:
ID = row[1]
if SrcDevice not in features.keys():
features[ID] = {}
features[ID]['Some_feature'] = 0
features[SrcDevice]['Some_feature'] += float(row[0])
But doing so with so many lines takes far far too long. The idea here is that I would also create a similar dictionary but with number of occurrences so that I can divide the features dictionary by that dictionary to calculate a mean of the features. (Seems unnecessary but remember that these files are so large they are read in line by line).
Using sqlite3 has been mentioned in the linked question but I would be interested in seeing if this can be done efficiently in Python first. Thanks in advance!
来源:https://stackoverflow.com/questions/53543951/calculate-running-total-from-csv-line-by-line