Data processing with adding columns dynamically in Python Pandas Dataframe

前端未结

关注

 1  498

不要未来只要你来 2021-01-23 11:28

I have the following problem. Lets say this is my CSV

id f1 f2 f3
1  4  5  5
1  3  1  0
1  7  4  4
1  4  3  1
1  1  4  6
2  2  6  0
..........

1条回答

后悔当初 (楼主)

2021-01-23 11:45

Groupby is your friend.

This will scale very well; only a small constant in the number of features. It will be roughly O(number of groups)

In [28]: features = ['f1','f2','f3']

Create some test data, group sizes are 7-12, 70k groups

In [29]: def create_df(i):
   ....:     l = np.random.randint(7,12)
   ....:     df = DataFrame(dict([ (f,np.arange(l)) for f in features ]))
   ....:     df['A'] = i
   ....:     return df
   ....: 

In [30]: df = concat([ create_df(i) for i in xrange(70000) ])

In [39]: df.info()

Int64Index: 629885 entries, 0 to 9
Data columns (total 4 columns):
f1    629885 non-null int64
f2    629885 non-null int64
f3    629885 non-null int64
A     629885 non-null int64
dtypes: int64(4)

Create a frame where you select the first 3 rows and the final row from each group (note that this WILL handle groups of size < 4, however your final row may overlap another, you may wish to do a groupby.filter to remedy this)

In [31]: groups = concat([df.groupby('A').head(3),df.groupby('A').tail(1)]).sort_index()

# This step is necesary in pandas < master/0.14 as the returned fields 
# will include the grouping field (the A), (is a bug/API issue)
In [33]: groups = groups[features]

In [34]: groups.head(20)
Out[34]: 
     f1  f2  f3
A              
0 0   0   0   0
  1   1   1   1
  2   2   2   2
  7   7   7   7
1 0   0   0   0
  1   1   1   1
  2   2   2   2
  9   9   9   9
2 0   0   0   0
  1   1   1   1
  2   2   2   2
  8   8   8   8
3 0   0   0   0
  1   1   1   1
  2   2   2   2
  8   8   8   8
4 0   0   0   0
  1   1   1   1
  2   2   2   2
  9   9   9   9

[20 rows x 3 columns]

In [38]: groups.info()

MultiIndex: 280000 entries, (0, 0) to (69999, 9)
Data columns (total 3 columns):
f1    280000 non-null int64
f2    280000 non-null int64
f3    280000 non-null int64
dtypes: int64(3)

And pretty fast

In [32]: %timeit concat([df.groupby('A').head(3),df.groupby('A').tail(1)]).sort_index()
1 loops, best of 3: 1.16 s per loop

For further manipulation you usually should stop here and use this (as its in a nice grouped format that's easy to deal with).

If you want to translate this to a wide format

In [35]: dfg = groups.groupby(level=0).apply(lambda x: Series(x.values.ravel()))

In [36]: %timeit groups.groupby(level=0).apply(lambda x: Series(x.values.ravel()))
dfg.head()
groups.info()
1 loops, best of 3: 14.5 s per loop
In [40]: dfg.columns = [ "{0}_{1}".format(f,i) for i in range(1,5) for f in features ]

In [41]: dfg.head()
Out[41]: 
   f1_1  f2_1  f3_1  f1_2  f2_2  f3_2  f1_3  f2_3  f3_3  f1_4  f2_4  f3_4
A                                                                        
0     0     0     0     1     1     1     2     2     2     7     7     7
1     0     0     0     1     1     1     2     2     2     9     9     9
2     0     0     0     1     1     1     2     2     2     8     8     8
3     0     0     0     1     1     1     2     2     2     8     8     8
4     0     0     0     1     1     1     2     2     2     9     9     9

[5 rows x 12 columns]

In [42]: dfg.info()

Int64Index: 70000 entries, 0 to 69999
Data columns (total 12 columns):
f1_1    70000 non-null int64
f2_1    70000 non-null int64
f3_1    70000 non-null int64
f1_2    70000 non-null int64
f2_2    70000 non-null int64
f3_2    70000 non-null int64
f1_3    70000 non-null int64
f2_3    70000 non-null int64
f3_3    70000 non-null int64
f1_4    70000 non-null int64
f2_4    70000 non-null int64
f3_4    70000 non-null int64
dtypes: int64(12)

0 讨论(0)