Given a pandas DataFrame as below:
import pandas as pd
from sklearn.metrics import mean_squared_error
df = pd.DataFrame.from_dict(
{\'row\': [\'a
The df.apply approach:
df['rmse'] = df.apply(lambda x: mean_squared_error(x[['a','b','c']], x[['d','e','y']])**0.5, axis=1)
col a b c d e y rmse
row
a 0.00 -0.80 -0.60 -0.30 0.80 0.01 1.003677
b -0.80 0.00 0.50 0.70 -0.90 0.01 1.048825
c -0.60 0.50 0.00 0.30 0.10 0.01 0.568653
d -0.30 0.70 0.30 0.00 0.20 0.01 0.375988
e 0.80 -0.90 0.10 0.20 0.00 0.01 0.626658
y 0.01 0.01 0.01 0.01 0.01 0.00 0.005774
Approach #1
One approach for performance would be to use the underlying array data alongwith NumPy ufuncs, alongwith slicing those two blocks of columns to use those ufuncs in a vectorized manner, like so -
a = df.values
rmse_out = np.sqrt(((a[:,0:3] - a[:,3:6])**2).mean(1))
df['rmse_out'] = rmse_out
Approach #2
Alternative faster way to compute the RMSE values with np.einsum
to replace the squared-summation
-
diffs = a[:,0:3] - a[:,3:6]
rmse_out = np.sqrt(np.einsum('ij,ij->i',diffs,diffs)/3.0)
Approach #3
Another way to compute rmse_out
using the formula :
(a - b)^2 = a^2 + b^2 - 2ab
would be to extract the slices :
s0 = a[:,0:3]
s1 = a[:,3:6]
Then, rmse_out
would be -
np.sqrt(((s0**2).sum(1) + (s1**2).sum(1) - (2*s0*s1).sum(1))/3.0)
which with einsum
becomes -
np.sqrt((np.einsum('ij,ij->i',s0,s0) + \
np.einsum('ij,ij->i',s1,s1) - \
2*np.einsum('ij,ij->i',s0,s1))/3.0)
Getting respective column indices
If you are not sure whether the columns a,b,..
would be in that order or not, we could find those indices with column_index.
Thus a[:,0:3]
would be replaced by a[:,column_index(df, ['a','b','c'])]
and a[:,3:6]
by a[:,column_index(df, ['d','e','y'])]
.