I am processing a csv
-file which is 2.5 GB big. The 2.5 GB table looks like this:
columns=[ka,kb_1,kb_2,timeofEvent,timeInterva
Based on your snippet, when reading line-by-line.
I assume that kb_2
is the error indicator,
groups={}
with open("data/petaJoined.csv", "r") as large_file:
for line in large_file:
arr=line.split('\t')
#assuming this structure: ka,kb_1,kb_2,timeofEvent,timeInterval
k=arr[0]+','+arr[1]
if not (k in groups.keys())
groups[k]={'record_count':0, 'error_sum': 0}
groups[k]['record_count']=groups[k]['record_count']+1
groups[k]['error_sum']=groups[k]['error_sum']+float(arr[2])
for k,v in groups.items:
print ('{group}: {error_rate}'.format(group=k,error_rate=v['error_sum']/v['record_count']))
This code snippet stores all the groups in a dictionary, and calculates the error rate after reading the entire file.
It will encounter an out-of-memory exception, if there are too many combinations of groups.
Q: Anyone knows what is happening?
A: Yes. Sum of all data memory-overheads for in-RAM objects !< RAM
It is a natural part of any formal abstraction to add some additional overhead in case some additional features are to be implemented on a higher ( a more abstract ) layer. That means that the more abstract / the more feature-rich representation of any dataset was chosen, the more memory- & processing-overheads are to be expected.
ITEMasINT = 32345
ITEMasTUPLE = ( 32345, )
ITEMasLIST = [ 32345, ]
ITEMasARRAY = np.array( [ 32345, ] )
ITEMasDICT = { 0: 32345, }
######## .__sizeof__() -> int\nsize of object in memory, in bytes'
ITEMasINT.__sizeof__() -> 12 #_____ 100% _ trivial INT
ITEMasTUPLE.__sizeof__() -> 16 # 133% _ en-tuple-d
ITEMasLIST.__sizeof__() -> 24 # 200% _ list-ed
ITEMasARRAY.__sizeof__() -> 40 # 333% _ numpy-wrapped
ITEMasDICT.__sizeof__() -> 124 # 1033% _ hash-associated asDict
If a personal experience is not enough, check the "costs" of re-wrapping the input ( already not small ) data into pandas
overheads:
CParserError: Error tokenizing data. C error: out of memory
Segmentation fault (core dumped)
and
CParserError: Error tokenizing data. C error: out of memory
*** glibc detected *** python: free(): ...
...
..
.
Aborted (core dumped)
Q: Maybe there will be a solution?
Simply follow the computational strategy and deploy memory-efficient & fast processing of the csv-input ( it's still a fileIO
having some 8-15 ms access time and quite a low performance stream data-flow, even if you use SSD-devices with about 960MB/s peak-transfer rate, your blocking-fact is the memory-allocation limit ... so rather be patient on input-stream and do not crash into a principal memory-barrier for any in-RAM super-object ( which would have been introduced just to be finally asked ( if it did not crash during it's instantiation ... ) to compute a plain sum/nROWs
).
A line-by-line or block-arranged reads allow you to calculate results on-the-fly and using a register-based ( asDict and alike for an interim storage of results ) sliding-window computation strategy is both fast and memory-efficient. ( Uri has provided an example for such )
This principal approach is used to be used in both real-time constrained systems and for system-on-chip designs, that were used for processing large data-streams for more than the last half century, so nothing new uder the Sun.
In case your results's size cannot fit in RAM, than it makes no sense to even start the processing of any input file, does it?
Processing BigData is neither about super-up-scaling of the COTS-dataObjects nor about finding a best or a most sexy "one-liner" ...
BigData requires a lot of understanding of the way how to process both fast and smart so as to avoid extreme costs of even small overheads, that are forgiving to do principal mistakes on just a few GB-s of small-bigData but will kill anyone's budget & efforts once trying the same on a larger playground.