Splitting dataframe into multiple dataframes

后端 未结 11 1153
南方客
南方客 2020-11-22 01:16

I have a very large dataframe (around 1 million rows) with data from an experiment (60 respondents).

I would like to split the dataframe into 60 dataframes (a datafra

11条回答
  •  温柔的废话
    2020-11-22 01:38

    In [28]: df = DataFrame(np.random.randn(1000000,10))
    
    In [29]: df
    Out[29]: 
    
    Int64Index: 1000000 entries, 0 to 999999
    Data columns (total 10 columns):
    0    1000000  non-null values
    1    1000000  non-null values
    2    1000000  non-null values
    3    1000000  non-null values
    4    1000000  non-null values
    5    1000000  non-null values
    6    1000000  non-null values
    7    1000000  non-null values
    8    1000000  non-null values
    9    1000000  non-null values
    dtypes: float64(10)
    
    In [30]: frames = [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
    
    In [31]: %timeit [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
    1 loops, best of 3: 849 ms per loop
    
    In [32]: len(frames)
    Out[32]: 16667
    

    Here's a groupby way (and you could do an arbitrary apply rather than sum)

    In [9]: g = df.groupby(lambda x: x/60)
    
    In [8]: g.sum()    
    
    Out[8]: 
    
    Int64Index: 16667 entries, 0 to 16666
    Data columns (total 10 columns):
    0    16667  non-null values
    1    16667  non-null values
    2    16667  non-null values
    3    16667  non-null values
    4    16667  non-null values
    5    16667  non-null values
    6    16667  non-null values
    7    16667  non-null values
    8    16667  non-null values
    9    16667  non-null values
    dtypes: float64(10)
    

    Sum is cythonized that's why this is so fast

    In [10]: %timeit g.sum()
    10 loops, best of 3: 27.5 ms per loop
    
    In [11]: %timeit df.groupby(lambda x: x/60)
    1 loops, best of 3: 231 ms per loop
    

提交回复
热议问题