Memory leak in Pandas.groupby.apply()?

前端未结

关注

 1  485

I\'m currently using Pandas for a project with csv source files of around 600mb. During the analysis I am reading in the csv to a dataframe, grouping on some column and applying

相关标签:

1条回答

佛祖请我去吃肉

2021-02-09 05:47
Using 0.14.1, I don't think their is a memory leak (1/3 size of your frame).
```
In [79]: df = DataFrame(np.random.randn(100000,3))

In [77]: %memit -r 3 df.groupby(df.index).apply(lambda x: x)
maximum of 3: 1365.652344 MB per loop

In [78]: %memit -r 10 df.groupby(df.index).apply(lambda x: x)
maximum of 10: 1365.683594 MB per loop
```
Two general comments on how to approach a problem like this:

1) use the cython level function if at all possible, will be MUCH faster, and will use much less memory. IOW, it almost always worth it to decouple a groupby expression and void using function (if possible, somethings are just too complicated, but that's the point, you want to break things down). e.g.

Instead of:
```
df.groupby(...).apply(lambda x: x.sum() / x.mean())
```
It is MUCH better to do:
```
g = df.groupby(...)
g.sum() / g.mean()
```
2) You can easily 'control' the groupby by doing your aggregation manually (additionally this will allow periodic output and garbage collection if needed).
```
results = []
for i, (g, grp) in enumerate(df.groupby(....)):

    if i % 500 == 0:
        print "checkpoint: %s" % i
        gc.collect()


    results.append(func(g,grp))

# final result
pd.concate(results)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...