What is the intuition of using tanh in LSTM
In LSTM Network ( Understanding LSTMs ), Why input gate and output gate use tanh? what is the intuition behind this? it is just a nonlinear transformation? if it is, can I change both to another activation function (e.g. ReLU)? Sigmoid specifically, is used as the gating function for the 3 gates(in, out, forget) in LSTM , since it outputs a value between 0 and 1, it can either let no flow or complete flow of information throughout the gates. On the other hand, to overcome the vanishing gradient problem, we need a function whose second derivative can sustain for a long range before going to