I\'m currently using Pandas for a project with csv source files of around 600mb. During the analysis I am reading in the csv to a dataframe, grouping on some column and applying
Using 0.14.1, I don't think their is a memory leak (1/3 size of your frame).
In [79]: df = DataFrame(np.random.randn(100000,3))
In [77]: %memit -r 3 df.groupby(df.index).apply(lambda x: x)
maximum of 3: 1365.652344 MB per loop
In [78]: %memit -r 10 df.groupby(df.index).apply(lambda x: x)
maximum of 10: 1365.683594 MB per loop
Two general comments on how to approach a problem like this:
1) use the cython level function if at all possible, will be MUCH faster, and will use much less memory. IOW, it almost always worth it to decouple a groupby expression and void using function (if possible, somethings are just too complicated, but that's the point, you want to break things down). e.g.
Instead of:
df.groupby(...).apply(lambda x: x.sum() / x.mean())
It is MUCH better to do:
g = df.groupby(...)
g.sum() / g.mean()
2) You can easily 'control' the groupby by doing your aggregation manually (additionally this will allow periodic output and garbage collection if needed).
results = []
for i, (g, grp) in enumerate(df.groupby(....)):
if i % 500 == 0:
print "checkpoint: %s" % i
gc.collect()
results.append(func(g,grp))
# final result
pd.concate(results)