Python Pandas MemoryError

前端 未结 2 1069
余生分开走
余生分开走 2021-01-14 22:14

I have those packages installed:

python: 2.7.3.final.0
python-bits: 64
OS: Linux
machine: x86_64
processor: x86_64
byteorder: little
pandas: 0.13.1


        
相关标签:
2条回答
  • 2021-01-14 22:54

    I can also reproduce it on 0.13.1, but the issue does not occur in 0.12 or in 0.14 (released yesterday), so it seems a bug in 0.13.
    So, maybe try to upgrade your pandas version, as the vectorized way is much faster as the apply (5s vs >1min on my machine), and using less peak memory (200Mb vs 980Mb, with %memit) on 0.14

    Using your sample data repeated 50000 times (leading to a df of 450k rows), and using the apply_id function of @jsalonen:

    In [23]: pd.__version__ 
    Out[23]: '0.14.0'
    
    In [24]: %timeit df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
    1 loops, best of 3: 5.42 s per loop
    
    In [25]: %timeit df_train.apply(apply_id, 1)
    1 loops, best of 3: 1min 11s per loop
    
    In [26]: %load_ext memory_profiler
    
    In [27]: %memit df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
    peak memory: 201.75 MiB, increment: 0.01 MiB
    
    In [28]: %memit df_train.apply(apply_id, 1)
    peak memory: 982.56 MiB, increment: 780.79 MiB
    
    0 讨论(0)
  • 2021-01-14 23:07

    Try generating the _id field with DataFrame.apply call:

    def apply_id(x):
        x['_id'] = "{}_{}_{}".format(x['Store'], x['Dept'], x['Date_Str'])
        return x
    
    df_train = df_train.apply(apply_id, 1)
    

    When using apply the id generation is performed per row resulting in minimal overhead in memory allocation.

    0 讨论(0)
提交回复
热议问题