Comparing Results from StandardScaler vs Normalizer in Linear Regression

前端 未结 3 2207
无人及你
无人及你 2021-02-18 21:40

I\'m working through some examples of Linear Regression under different scenarios, comparing the results from using Normalizer and StandardScaler, and

3条回答
  •  执念已碎
    2021-02-18 22:20

    1. The reason for no difference in co-efficients between the first two models is that Sklearn de-normalize the co-efficients behind the scenes after calculating the co-effs from normalized input data. Reference

    This de-normalization has been done because for test data, we can directly apply the co-effs. and get the prediction without normalizing the test data.

    Hence, setting normalize=True do have impact on co-efficients but they dont affect the best fit line anyway.

    1. Normalizer does the normalization with respect to each sample (meaning row-wise). You see the reference code here.

    From documentation:

    Normalize samples individually to unit norm.

    whereas normalize=True does the normalization with respect to each column/ feature. Reference

    Example to understand the impact of normalization at different dimension of the data. Let us take two dimensions x1 & x2 and y be the target variable. Target variable value is color coded in the figure.

    import matplotlib.pyplot as plt
    from sklearn.preprocessing import Normalizer,StandardScaler
    from sklearn.preprocessing.data import normalize
    
    n=50
    x1 = np.random.normal(0, 2, size=n)
    x2 = np.random.normal(0, 2, size=n)
    noise = np.random.normal(0, 1, size=n)
    y = 5 + 0.5*x1 + 2.5*x2 + noise
    
    fig,ax=plt.subplots(1,4,figsize=(20,6))
    
    ax[0].scatter(x1,x2,c=y)
    ax[0].set_title('raw_data',size=15)
    
    X = np.column_stack((x1,x2))
    
    column_normalized=normalize(X, axis=0)
    ax[1].scatter(column_normalized[:,0],column_normalized[:,1],c=y)
    ax[1].set_title('column_normalized data',size=15)
    
    row_normalized=Normalizer().fit_transform(X)
    ax[2].scatter(row_normalized[:,0],row_normalized[:,1],c=y)
    ax[2].set_title('row_normalized data',size=15)
    
    standardized_data=StandardScaler().fit_transform(X)
    ax[3].scatter(standardized_data[:,0],standardized_data[:,1],c=y)
    ax[3].set_title('standardized data',size=15)
    
    plt.subplots_adjust(left=0.3, bottom=None, right=0.9, top=None, wspace=0.3, hspace=None)
    plt.show()
    

    You could see that best fit line for data in fig 1,2 and 4 would be the same; signifies that the R2_-score will not change due to column/feature normalization or standardizing data. Just that, it ends up with different co-effs. values.

    Note: best fit line for fig3 would be different.

    1. When you set the fit_intercept=False, bias term is subtracted from the prediction. Meaning the intercept is set to zero, which otherwise would have been mean of target variable.

    The prediction with intercept as zero would be expected to perform bad for problems where target variables are not scaled (mean =0). You can see a difference of 22.532 in every row, which signifies the impact of the output.

提交回复
热议问题