问题
I have 2 columns and I want a 3rd column to be the minimum value between them. My data looks like this:
A B
0 2 1
1 2 1
2 2 4
3 2 4
4 3 5
5 3 5
6 3 6
7 3 6
And I want to get a column C in the following way:
A B C
0 2 1 1
1 2 1 1
2 2 4 2
3 2 4 2
4 3 5 3
5 3 5 3
6 3 6 3
7 3 6 3
Some helping code:
df = pd.DataFrame({'A': [2, 2, 2, 2, 3, 3, 3, 3],
'B': [1, 1, 4, 4, 5, 5, 6, 6]})
Thanks!
回答1:
Use df.min(axis=1)
df['c'] = df.min(axis=1)
df
Out[41]:
A B c
0 2 1 1
1 2 1 1
2 2 4 2
3 2 4 2
4 3 5 3
5 3 5 3
6 3 6 3
7 3 6 3
This returns the min row-wise (when passing axis=1
)
For non-heterogenous dtypes and large dfs you can use numpy.min which will be quicker:
In[42]:
df['c'] = np.min(df.values,axis=1)
df
Out[42]:
A B c
0 2 1 1
1 2 1 1
2 2 4 2
3 2 4 2
4 3 5 3
5 3 5 3
6 3 6 3
7 3 6 3
timings:
In[45]:
df = pd.DataFrame({'A': [2, 2, 2, 2, 3, 3, 3, 3],
'B': [1, 1, 4, 4, 5, 5, 6, 6]})
df = pd.concat([df]*1000, ignore_index=True)
df.shape
Out[45]: (8000, 2)
So for a 8K row df:
%timeit df.min(axis=1)
%timeit np.min(df.values,axis=1)
314 µs ± 3.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
34.4 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
You can see that the numpy version is nearly 10x quicker (note I pass df.values
so we pass a numpy array), this will become more of a factor when we get to even larger dfs
Note
for versions 0.24.0
or greater, use to_numpy()
so the above becomes:
df['c'] = np.min(df.to_numpy(),axis=1)
Timings:
%timeit df.min(axis=1)
%timeit np.min(df.values,axis=1)
%timeit np.min(df.to_numpy(),axis=1)
314 µs ± 3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
35.2 µs ± 680 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
35.5 µs ± 262 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
There is a minor discrepancy between .values
and to_numpy()
, it depends on whether you know upfront that the dtype is not mixed, and that the likely dtype is a factor e.g. float 16
vs float 32
see that link for further explanation. Pandas is doing a little more checking when calling to_numpy
来源:https://stackoverflow.com/questions/55654105/pandas-get-the-min-value-between-2-dataframe-columns