Why do we want to scale outputs when using dropout?

问题

From the dropout paper:

"The idea is to use a single neural net at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time as shown in Figure 2. This ensures that for any hidden unit the expected output (under the distribution used to drop units at training time) is the same as the actual output at test time."

Why do we want to preserve the expected output? If we use ReLU activations, linear scaling of weights or activations results in linear scaling of network outputs and does not have any effect on the classification accuracy.

What am I missing?

回答1:

To be precise, we want to preserve not the "expected output" but the expected value of the output, that is, we want to make up for the difference in training (when we don't pass values of some nodes) and testing phases by preserving mean (expected) values of outputs.

In case of ReLU activations this scaling indeed leads to linear scaling of outputs (when they are positive) but why do you think it doesn't affect final accuracy of a classification model? At least in the end, we usually apply either softmax of sigmoid which are non-linear and depend on this scaling.

来源：https://stackoverflow.com/questions/53689156/why-do-we-want-to-scale-outputs-when-using-dropout

标签

machine-learning

neural-network

deep-learning

dropout

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!