Specific pandas columns as arguments in new column of df.apply outputs

前端 未结 2 424
渐次进展
渐次进展 2021-01-24 18:37

Given a pandas DataFrame as below:

import pandas as pd
from sklearn.metrics import mean_squared_error

    df = pd.DataFrame.from_dict(  
         {\'row\': [\'a         


        
2条回答
  •  鱼传尺愫
    2021-01-24 19:04

    Approach #1

    One approach for performance would be to use the underlying array data alongwith NumPy ufuncs, alongwith slicing those two blocks of columns to use those ufuncs in a vectorized manner, like so -

    a = df.values
    rmse_out = np.sqrt(((a[:,0:3] - a[:,3:6])**2).mean(1))
    df['rmse_out'] = rmse_out
    

    Approach #2

    Alternative faster way to compute the RMSE values with np.einsum to replace the squared-summation -

    diffs = a[:,0:3] - a[:,3:6]
    rmse_out = np.sqrt(np.einsum('ij,ij->i',diffs,diffs)/3.0)
    

    Approach #3

    Another way to compute rmse_out using the formula :

    (a - b)^2 = a^2 + b^2 - 2ab

    would be to extract the slices :

    s0 = a[:,0:3]
    s1 = a[:,3:6]
    

    Then, rmse_out would be -

    np.sqrt(((s0**2).sum(1) + (s1**2).sum(1) - (2*s0*s1).sum(1))/3.0)
    

    which with einsum becomes -

    np.sqrt((np.einsum('ij,ij->i',s0,s0) + \
             np.einsum('ij,ij->i',s1,s1) - \
           2*np.einsum('ij,ij->i',s0,s1))/3.0)
    

    Getting respective column indices

    If you are not sure whether the columns a,b,.. would be in that order or not, we could find those indices with column_index.

    Thus a[:,0:3] would be replaced by a[:,column_index(df, ['a','b','c'])] and a[:,3:6] by a[:,column_index(df, ['d','e','y'])].

提交回复
热议问题