Memory leak in Pandas.groupby.apply()?

前端 未结 1 485
眼角桃花
眼角桃花 2021-02-09 05:13

I\'m currently using Pandas for a project with csv source files of around 600mb. During the analysis I am reading in the csv to a dataframe, grouping on some column and applying

相关标签:
1条回答
  • 2021-02-09 05:47

    Using 0.14.1, I don't think their is a memory leak (1/3 size of your frame).

    In [79]: df = DataFrame(np.random.randn(100000,3))
    
    In [77]: %memit -r 3 df.groupby(df.index).apply(lambda x: x)
    maximum of 3: 1365.652344 MB per loop
    
    In [78]: %memit -r 10 df.groupby(df.index).apply(lambda x: x)
    maximum of 10: 1365.683594 MB per loop
    

    Two general comments on how to approach a problem like this:

    1) use the cython level function if at all possible, will be MUCH faster, and will use much less memory. IOW, it almost always worth it to decouple a groupby expression and void using function (if possible, somethings are just too complicated, but that's the point, you want to break things down). e.g.

    Instead of:

    df.groupby(...).apply(lambda x: x.sum() / x.mean())
    

    It is MUCH better to do:

    g = df.groupby(...)
    g.sum() / g.mean()
    

    2) You can easily 'control' the groupby by doing your aggregation manually (additionally this will allow periodic output and garbage collection if needed).

    results = []
    for i, (g, grp) in enumerate(df.groupby(....)):
    
        if i % 500 == 0:
            print "checkpoint: %s" % i
            gc.collect()
    
    
        results.append(func(g,grp))
    
    # final result
    pd.concate(results)
    
    0 讨论(0)
提交回复
热议问题