Is the Keras implementation of dropout correct?

僤鯓⒐⒋嵵緔 提交于 2019-12-01 03:29:36

Yes. It is implemented properly. From the time when Dropout was invented - folks improved it also from the implementation point of view. Keras is using one of this techniques. It's called inverted dropout and you may read about it here.

UPDATE:

To be honest - in the strict mathematical sense this two approaches are not equivalent. In inverted case you are multiplying every hidden activation by a reciprocal of dropout parameter. But due to that derivative is linear it is equivalent to multiplying all gradient by the same factor. To overcome this difference you must set different learning weight then. From this point of view this approaches differ. But from a practical point view - this approaches are equivalent because:

  1. If you use a method which automatically sets the learning rate (like RMSProp or Adagrad) - it will make almost no change in algorithm.
  2. If you use a method where you set your learning rate automatically - you must take into account the stochastic nature of dropout and that due to the fact that some neurons will be turned off during training phase (what will not happen during test / evaluation phase) - you must to rescale your learning rate in order to overcome this difference. Probability theory gives us the best rescalling factor - and it is a reciprocal of dropout parameter which makes the expected value of a loss function gradient length the same in both train and test / eval phases.

Of course - both points above are about inverted dropout technique.

Excerpted from the original Dropout paper (Section 10):

In this paper, we described dropout as a method where we retain units with probability p at training time and scale down the weights by multiplying them by a factor of p at test time. Another way to achieve the same effect is to scale up the retained activations by multiplying by 1/p at training time and not modifying the weights at test time. These methods are equivalent with appropriate scaling of the learning rate and weight initializations at each layer.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!