问题
I just implemented AdaDelta (http://arxiv.org/abs/1212.5701) for my own Deep Neural Network Library. The paper kind of says that SGD with AdaDelta is not sensitive to hyperparameters, and that it always converge to somewhere good. (at least the output reconstruction loss of AdaDelta-SGD is comparable to that of well-tuned Momentum method)
When I used AdaDelta-SGD as learning method in in Denoising AutoEncoder, it did converge in some specific settings, but not always. When I used MSE as loss function, and Sigmoid as activation function, it converged very quickly, and after iterations of 100 epochs, the final reconstruction loss was better than all of plain SGD, SGD with Momentum, and AdaGrad.
But when I used ReLU as activation function, it didn't converge but continued to be stacked(oscillating) with high(bad) reconstruction loss (just like the case when you used plain SGD with very high learning rate). The magnitude of reconstruction loss it stacked was about 10 to 20 times higher than the final reconstruction loss generated with Momentum method.
I really don't understand why it happened since the paper says AdaDelta is just good. Please let me know the reason behind the phenomena and teach me how I could avoid it.
回答1:
The activation of a ReLU is unbounded, making its use in Auto Encoders difficult since your training vectors likely do not have arbitrarily large and unbounded responses! ReLU simply isn't a good fit for that type of network.
You can force a ReLU into an auto encoder by applying some transformation to the output layer, as is done here. However, hey don't discuss the quality of the results in terms of an auto-encoder, but instead only as a pre-training method for classification. So its not clear that its a worth while endeavor for building an auto encoder either.
来源:https://stackoverflow.com/questions/24830365/ada-delta-method-doesnt-converge-when-used-in-denoising-autoencoder-with-mse-lo