Tensorflow offers a nice LSTM wrapper.
rnn_cell.BasicLSTM(num_units, forget_bias=1.0, input_size=None,
state_is_tuple=False, activation=tanh)
<
I like to do the following, yet the only thing I know is that some parameters prefers not to be regularized with L2, such as batch norm parameters and biases. LSTMs contains one Bias tensor (despite conceptually it has many biases, they seem to be concatenated or something, for performance), and for the batch normalization I add "noreg" in the variables' name to ignore it too.
loss = your regular output loss
l2 = lambda_l2_reg * sum(
tf.nn.l2_loss(tf_var)
for tf_var in tf.trainable_variables()
if not ("noreg" in tf_var.name or "Bias" in tf_var.name)
)
loss += l2
Where lambda_l2_reg
is the small multiplier, e.g.: float(0.005)
Doing this selection (which is the full if
in the loop discarding some variables in the regularization) once made me jump from 0.879 F1 score to 0.890 in one shot of testing the code without readjusting the value of the config's lambda
, well this was including both the changes for the batch normalisation and the Biases and I had other biases in the neural network.
According to this paper, regularizing the recurrent weights may help with exploding gradients.
Also, according to this other paper, dropout would be better used between stacked cells and not inside cells if you use some.
About the exploding gradient problem, if you use gradient clipping with the loss that has the L2 regularization already added to it, that regularization will be taken into account too during the clipping process.
P.S. Here is the neural network I was working on: https://github.com/guillaume-chevalier/HAR-stacked-residual-bidir-LSTMs