I am unable to understand the page of the StandardScaler
in the documentation of sklearn
.
Can anyone explain this to me in simple terms?
StandardScaler performs the task of Standardization. Usually a dataset contains variables that are different in scale. For e.g. an Employee dataset will contain AGE column with values on scale 20-70 and SALARY column with values on scale 10000-80000.
As these two columns are different in scale, they are Standardized to have common scale while building machine learning model.
How to calculate it:
You can read more here:
The idea behind StandardScaler
is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1.
In case of multivariate data, this is done feature-wise (in other words independently for each column of the data).
Given the distribution of the data, each value in the dataset will have the mean value subtracted, and then divided by the standard deviation of the whole dataset (or feature in the multivariate case).
This is useful when you want to compare data that correspond to different units. In that case, you want to remove the units. To do that in a consistent way of all the data, you transform the data in a way that the variance is unitary and that the mean of the series is 0.
We apply StandardScalar()
on a row basis.
So, for each row in a column (I am assuming that you are working with a Pandas DataFrame):
x_new = (x_original - mean_of_distribution) / std_of_distribution
Few points -
It is called Standard Scalar as we are dividing it by the standard deviation of the distribution (distr. of the feature). Similarly, you can guess for MinMaxScalar()
.
The original distribution remains the same after applying StandardScalar()
. It is a common misconception that the distribution gets changed to a Normal Distribution. We are just squashing the range into [0, 1].
After applying StandardScaler()
, each column in X will have mean of 0 and standard deviation of 1.
Formulas are listed by others on this page.
Rationale: some algorithms require data to look like this (see sklearn docs).