Why feature scaling? | 易学教程

问题

I found that scaling in SVM (Support Vector Machine) problems really improve its performance... I have read this explanation:

"The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges."

Unfortunately this didn't help me ... Can somebody provide me a better explanation? Thank you in advance!

回答1:

The true reason behind scaling features in SVM is the fact, that this classifier is not affine transformation invariant. In other words, if you multiply one feature by a 1000 than a solution given by SVM will be completely different. It has nearly nothing to do with the underlying optimization techniques (although they are affected by these scales problems, they should still converge to global optimum).

Consider an example: you have man and a woman, encoded by their sex and height (two features). Let us assume a very simple case with such data:

0 -> man 1 -> woman

╔═════╦════════╗
║ sex ║ height ║
╠═════╬════════╣
║  1  ║  150   ║
╠═════╬════════╣
║  1  ║  160   ║
╠═════╬════════╣
║  1  ║  170   ║
╠═════╬════════╣
║  0  ║  180   ║
╠═════╬════════╣
║  0  ║  190   ║
╠═════╬════════╣
║  0  ║  200   ║
╚═════╩════════╝

And let us do something silly. Train it to predict the sex of the person, so we are trying to learn f(x,y)=x (ignoring second parameter).

It is easy to see, that for such data largest margin classifier will "cut" the plane horizontally somewhere around height "175", so once we get new sample "0 178" (a woman of 178cm height) we get the classification that she is a man.

However, if we scale down everything to [0,1] we get sth like

╔═════╦════════╗
║ sex ║ height ║
╠═════╬════════╣
║  1  ║  0.0   ║
╠═════╬════════╣
║  1  ║  0.2   ║
╠═════╬════════╣
║  1  ║  0.4   ║
╠═════╬════════╣
║  0  ║  0.6   ║
╠═════╬════════╣
║  0  ║  0.8   ║
╠═════╬════════╣
║  0  ║  1.0   ║
╚═════╩════════╝

and now largest margin classifier "cuts" the plane nearly vertically (as expected) and so given new sample "0 178" which is also scaled to around "0 0.56" we get that it is a woman (correct!)

So in general - scaling ensures that just because some features are big it won't lead to using them as a main predictor.

回答2:

Feature scaling is a general trick applied to optimization problems (not just SVM). The underline algorithm to solve the optimization problem of SVM is gradient descend. Andrew Ng has a great explanation in his coursera videos here.

I will illustrate the core ideas here (I borrow Andrew's slides). Suppose you have only two parameters and one of the parameters can take a relatively large range of values. Then the contour of the cost function can look like very tall and skinny ovals (see blue ovals below). Your gradients (the path of gradient is drawn in red) could take a long time and go back and forth to find the optimal solution.

Instead if your scaled your feature, the contour of the cost function might look like circles; then the gradient can take a much more straight path and achieve the optimal point much faster.

回答3:

Just personal thoughts from another perspective.
1. why feature scaling influence?
There's a word in applying machine learning algorithm, 'garbage in, garbage out'. The more real reflection of your features, the more accuracy your algorithm will get. That applies too for how machine learning algorithms treat relationship between features. Different from human's brain, when machine learning algorithms do the classify for example, all the features are expressed and calculated by the same coordinate system, which in some sense, establish a priori assumption between the features(not really reflection of data itself). And also the nature of most algorithms is to find the most appropriate weight percentage between the features to fittest the data. So when these algorithms' input is unscaled features, large scale data has more influence on the weight. Actually it's not the reflection of data iteself.
2. why usually feature scaling improve the accuracy?
The common practice in unsupervised machine learning algorithms about the hyper-parameters(or hyper-hyper parameters) selection(for example, hierachical Dirichlet process, hLDA) is that you should not add any personal subjective assumption about data. The best way is just to assume that they have the equality probability to appear. I think it applies here too. The feature scaling just try to make the assumption that all the features has the equality opportunity to influence the weight, which more really reflects the information/knowledge you know about the data. Commonly also result in better accuracy.

BTW, about the affine transformation invariant and converge faster, there's are interest link here on stats.stackexchange.com.

回答4:

We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven. This is from Andrews NG coursera course.

So, it is done to do something like standardizing the data. Sometimes researchers want to know if a specific observation is common or exceptional. express a score in terms of the number of standard deviations it is removed from the mean. This number is what we call a z-score. If we recode original scores into z-scores, we say that we standardize a variable.

回答5:

From what i have learnt from the Andrew Ng course on coursera is that feature scaling helps us to achieve the gradient decent more quickly,if the data is more spread out,that means if it has a higher standerd deviation,it will relatively take more time to calculate the gradient decent compared to the situation when we scale our data via feature scaling

回答6:

The Idea of scaling is to remove exess computes on a particular variable by standardising all the variable on to a same scale with this we tend to calculate the slope a lot more easier ( y = mx + c) where we are normalizing the M parameter to converge as quickly as possible.

回答7:

Yes if normalisation is not there then contour will be skinny thus with normalisation:

Values be within the range
Speeds up the calculation of theta because number of calculations require will be less

来源：https://stackoverflow.com/questions/26225344/why-feature-scaling

标签

machine-learning

svm

scaling