问题
From the dropout paper:
"The idea is to use a single neural net at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time as shown in Figure 2. This ensures that for any hidden unit the expected output (under the distribution used to drop units at training time) is the same as the actual output at test time."
Why do we want to preserve the expected output? If we use ReLU activations, linear scaling of weights or activations results in linear scaling of network outputs and does not have any effect on the classification accuracy.
What am I missing?
回答1:
To be precise, we want to preserve not the "expected output" but the expected value of the output, that is, we want to make up for the difference in training (when we don't pass values of some nodes) and testing phases by preserving mean (expected) values of outputs.
In case of ReLU activations this scaling indeed leads to linear scaling of outputs (when they are positive) but why do you think it doesn't affect final accuracy of a classification model? At least in the end, we usually apply either softmax of sigmoid which are non-linear and depend on this scaling.
来源:https://stackoverflow.com/questions/53689156/why-do-we-want-to-scale-outputs-when-using-dropout