Specific pandas columns as arguments in new column of df.apply outputs

前端 未结 2 423
渐次进展
渐次进展 2021-01-24 18:37

Given a pandas DataFrame as below:

import pandas as pd
from sklearn.metrics import mean_squared_error

    df = pd.DataFrame.from_dict(  
         {\'row\': [\'a         


        
相关标签:
2条回答
  • 2021-01-24 19:00

    The df.apply approach:

    df['rmse'] = df.apply(lambda x: mean_squared_error(x[['a','b','c']], x[['d','e','y']])**0.5, axis=1)
    
    col     a     b     c     d     e     y      rmse
    row                                              
    a    0.00 -0.80 -0.60 -0.30  0.80  0.01  1.003677
    b   -0.80  0.00  0.50  0.70 -0.90  0.01  1.048825
    c   -0.60  0.50  0.00  0.30  0.10  0.01  0.568653
    d   -0.30  0.70  0.30  0.00  0.20  0.01  0.375988
    e    0.80 -0.90  0.10  0.20  0.00  0.01  0.626658
    y    0.01  0.01  0.01  0.01  0.01  0.00  0.005774
    
    0 讨论(0)
  • 2021-01-24 19:04

    Approach #1

    One approach for performance would be to use the underlying array data alongwith NumPy ufuncs, alongwith slicing those two blocks of columns to use those ufuncs in a vectorized manner, like so -

    a = df.values
    rmse_out = np.sqrt(((a[:,0:3] - a[:,3:6])**2).mean(1))
    df['rmse_out'] = rmse_out
    

    Approach #2

    Alternative faster way to compute the RMSE values with np.einsum to replace the squared-summation -

    diffs = a[:,0:3] - a[:,3:6]
    rmse_out = np.sqrt(np.einsum('ij,ij->i',diffs,diffs)/3.0)
    

    Approach #3

    Another way to compute rmse_out using the formula :

    (a - b)^2 = a^2 + b^2 - 2ab

    would be to extract the slices :

    s0 = a[:,0:3]
    s1 = a[:,3:6]
    

    Then, rmse_out would be -

    np.sqrt(((s0**2).sum(1) + (s1**2).sum(1) - (2*s0*s1).sum(1))/3.0)
    

    which with einsum becomes -

    np.sqrt((np.einsum('ij,ij->i',s0,s0) + \
             np.einsum('ij,ij->i',s1,s1) - \
           2*np.einsum('ij,ij->i',s0,s1))/3.0)
    

    Getting respective column indices

    If you are not sure whether the columns a,b,.. would be in that order or not, we could find those indices with column_index.

    Thus a[:,0:3] would be replaced by a[:,column_index(df, ['a','b','c'])] and a[:,3:6] by a[:,column_index(df, ['d','e','y'])].

    0 讨论(0)
提交回复
热议问题