Data processing with adding columns dynamically in Python Pandas Dataframe

前端 未结 1 497

I have the following problem. Lets say this is my CSV

id f1 f2 f3
1  4  5  5
1  3  1  0
1  7  4  4
1  4  3  1
1  1  4  6
2  2  6  0
..........

相关标签:
1条回答
  • 2021-01-23 11:45

    Groupby is your friend.

    This will scale very well; only a small constant in the number of features. It will be roughly O(number of groups)

    In [28]: features = ['f1','f2','f3']
    

    Create some test data, group sizes are 7-12, 70k groups

    In [29]: def create_df(i):
       ....:     l = np.random.randint(7,12)
       ....:     df = DataFrame(dict([ (f,np.arange(l)) for f in features ]))
       ....:     df['A'] = i
       ....:     return df
       ....: 
    
    In [30]: df = concat([ create_df(i) for i in xrange(70000) ])
    
    In [39]: df.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 629885 entries, 0 to 9
    Data columns (total 4 columns):
    f1    629885 non-null int64
    f2    629885 non-null int64
    f3    629885 non-null int64
    A     629885 non-null int64
    dtypes: int64(4)
    

    Create a frame where you select the first 3 rows and the final row from each group (note that this WILL handle groups of size < 4, however your final row may overlap another, you may wish to do a groupby.filter to remedy this)

    In [31]: groups = concat([df.groupby('A').head(3),df.groupby('A').tail(1)]).sort_index()
    
    # This step is necesary in pandas < master/0.14 as the returned fields 
    # will include the grouping field (the A), (is a bug/API issue)
    In [33]: groups = groups[features]
    
    In [34]: groups.head(20)
    Out[34]: 
         f1  f2  f3
    A              
    0 0   0   0   0
      1   1   1   1
      2   2   2   2
      7   7   7   7
    1 0   0   0   0
      1   1   1   1
      2   2   2   2
      9   9   9   9
    2 0   0   0   0
      1   1   1   1
      2   2   2   2
      8   8   8   8
    3 0   0   0   0
      1   1   1   1
      2   2   2   2
      8   8   8   8
    4 0   0   0   0
      1   1   1   1
      2   2   2   2
      9   9   9   9
    
    [20 rows x 3 columns]
    
    In [38]: groups.info()
    <class 'pandas.core.frame.DataFrame'>
    MultiIndex: 280000 entries, (0, 0) to (69999, 9)
    Data columns (total 3 columns):
    f1    280000 non-null int64
    f2    280000 non-null int64
    f3    280000 non-null int64
    dtypes: int64(3)
    

    And pretty fast

    In [32]: %timeit concat([df.groupby('A').head(3),df.groupby('A').tail(1)]).sort_index()
    1 loops, best of 3: 1.16 s per loop
    

    For further manipulation you usually should stop here and use this (as its in a nice grouped format that's easy to deal with).

    If you want to translate this to a wide format

    In [35]: dfg = groups.groupby(level=0).apply(lambda x: Series(x.values.ravel()))
    
    In [36]: %timeit groups.groupby(level=0).apply(lambda x: Series(x.values.ravel()))
    dfg.head()
    groups.info()
    1 loops, best of 3: 14.5 s per loop
    In [40]: dfg.columns = [ "{0}_{1}".format(f,i) for i in range(1,5) for f in features ]
    
    In [41]: dfg.head()
    Out[41]: 
       f1_1  f2_1  f3_1  f1_2  f2_2  f3_2  f1_3  f2_3  f3_3  f1_4  f2_4  f3_4
    A                                                                        
    0     0     0     0     1     1     1     2     2     2     7     7     7
    1     0     0     0     1     1     1     2     2     2     9     9     9
    2     0     0     0     1     1     1     2     2     2     8     8     8
    3     0     0     0     1     1     1     2     2     2     8     8     8
    4     0     0     0     1     1     1     2     2     2     9     9     9
    
    [5 rows x 12 columns]
    
    In [42]: dfg.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 70000 entries, 0 to 69999
    Data columns (total 12 columns):
    f1_1    70000 non-null int64
    f2_1    70000 non-null int64
    f3_1    70000 non-null int64
    f1_2    70000 non-null int64
    f2_2    70000 non-null int64
    f3_2    70000 non-null int64
    f1_3    70000 non-null int64
    f2_3    70000 non-null int64
    f3_3    70000 non-null int64
    f1_4    70000 non-null int64
    f2_4    70000 non-null int64
    f3_4    70000 non-null int64
    dtypes: int64(12)
    
    0 讨论(0)
提交回复
热议问题