Why use softmax as opposed to standard normalization?

后端 未结 9 2173
一整个雨季
一整个雨季 2020-12-02 03:43

In the output layer of a neural network, it is typical to use the softmax function to approximate a probability distribution:

相关标签:
9条回答
  • 2020-12-02 04:20

    I have found the explanation here to be very good: CS231n: Convolutional Neural Networks for Visual Recognition.

    On the surface the softmax algorithm seems to be a simple non-linear (we are spreading the data with exponential) normalization. However, there is more than that.

    Specifically there are a couple different views (same link as above):

    1. Information Theory - from the perspective of information theory the softmax function can be seen as trying to minimize the cross-entropy between the predictions and the truth.

    2. Probabilistic View - from this perspective we are in fact looking at the log-probabilities, thus when we perform exponentiation we end up with the raw probabilities. In this case the softmax equation find the MLE (Maximum Likelihood Estimate)

    In summary, even though the softmax equation seems like it could be arbitrary it is NOT. It is actually a rather principled way of normalizing the classifications to minimize cross-entropy/negative likelihood between predictions and the truth.

    0 讨论(0)
  • 2020-12-02 04:20

    The values of q_i represent log-likelihoods. In order to recover the probability values, you need to exponentiate them.

    One reason that statistical algorithms often use log-likelihood loss functions is that they are more numerically stable: a product of probabilities may be represented be a very small floating point number. Using a log-likelihood loss function, a product of probabilities becomes a sum.

    Another reason is that log-likelihoods occur naturally when deriving estimators for random variables that are assumed to be drawn from multivariate Gaussian distributions. See for example the Maximum Likelihood (ML) estimator and the way it is connected to least squares.

    As a sidenote, I think that this question is more appropriate for the CS Theory or Computational Science Stack Exchanges.

    0 讨论(0)
  • 2020-12-02 04:30

    Suppose we change the softmax function so the output activations are given by

    where c is a positive constant. Note that c=1 corresponds to the standard softmax function. But if we use a different value of c we get a different function, which is nonetheless qualitatively rather similar to the softmax. In particular, show that the output activations form a probability distribution, just as for the usual softmax. Suppose we allow c to become large, i.e., c→∞. What is the limiting value for the output activations a^L_j? After solving this problem it should be clear to you why we think of the c=1 function as a "softened" version of the maximum function. This is the origin of the term "softmax". You can follow the details from this source (equation 83).

    0 讨论(0)
提交回复
热议问题