问题
I am building an RNN for classification (there is a softmax layer after the RNN). There are so many options for what to regularize and I am not sure if to just try all of them, would the effect be the same? which components do I regularize for what situation?
The components being:
- Kernel weights (layer input)
- Recurrent weights
- Bias
- Activation function (layer output)
回答1:
Regularizers that'll work best will depend on your specific architecture, data, and problem; as usual, there isn't a single cut to rule all, but there are do's and (especially) don't's, as well as systematic means of determining what'll work best - via careful introspection and evaluation.
How does RNN regularization work?
Perhaps the best approach to understanding it is information-based. First, see "How does 'learning' work?" and "RNN: Depth vs. Width". To understand RNN regularization, one must understand how RNN handles information and learns, which the referred sections describe (though not exhaustively). Now to answer the question:
RNN regularization's goal is any regularization's goal: maximizing information utility and traversal of the test loss function. The specific methods, however, tend to differ substantially for RNNs per their recurrent nature - and some work better than others; see below.
RNN regularization methods:
WEIGHT DECAY
General: shrinks the norm ('average') of the weight matrix
- Linearization, depending on activation; e.g.
sigmoid
,tanh
, but less sorelu
- Gradient boost, depending on activation; e.g.
sigmoid
,tanh
grads flatten out for large activations - linearizing enables neurons to keep learning
- Linearization, depending on activation; e.g.
Recurrent weights: default
activation='sigmoid'
- Pros: linearizing can help BPTT (remedy vanishing gradient), hence also learning long-term dependencies, as recurrent information utility is increased
- Cons: linearizing can harm representational power - however, this can be offset by stacking RNNs
Kernel weights: for many-to-one (
return_sequences=False
), they work similar to weight decay on a typical layer (e.g.Dense
). For many-to-many (=True
), however, kernel weights operate on every timestep, so pros & cons similar to above will apply.
Dropout:
- Activations (kernel): can benefit, but only if limited; values are usually kept less than
0.2
in practice. Problem: tends to introduce too much noise, and erase important context information, especially in problems w/ limited timesteps. - Recurrent activations (
recurrent_dropout
): the recommended dropout
Batch Normalization:
- Activations (kernel): worth trying. Can benefit substantially, or not.
- Recurrent activations: should work better; see Recurrent Batch Normalization. No Keras implementations yet as far as I know, but I may implement it in the future.
Weight Constraints: set hard upper-bound on weights l2-norm; possible alternative to weight decay.
Activity Constraints: don't bother; for most purposes, if you have to manually constrain your outputs, the layer itself is probably learning poorly, and the solution is elsewhere.
What should I do? Lots of info - so here's some concrete advice:
Weight decay: try
1e-3
,1e-4
, see which works better. Do not expect the same value of decay to work forkernel
andrecurrent_kernel
, especially depending on architecture. Check weight shapes - if one is much smaller than the other, apply smaller decay to formerDropout: try
0.1
. If you see improvement, try0.2
- else, scrap itRecurrent Dropout: start with
0.2
. Improvement -->0.4
. Improvement -->0.5
, else0.3
.- Batch Normalization: try. Improvement --> keep it - else, scrap it.
- Recurrent Batchnorm: same as 4.
- Weight constraints: advisable w/ higher learning rates to prevent exploding gradients - else use higher weight decay
- Activity constraints: probably not (see above)
- Residual RNNs: introduce significant changes, along a regularizing effect. See application in IndRNNs
- Biases: put simply, I don't know. No one seems to bother with them, so I haven't experimented much either. With
BatchNormalization
, however, you can setuse_bias=False
- Zoneout: don't know, never tried, might work - see paper.
- Layer Normalization: some report it working better than BN for RNNs - but my application found it otherwise; paper
- Data shuffling: is a strong regularizer. Also shuffle batch samples (samples in batch). See relevant info on stateful RNNs
- Optimizer: can be an inherent regularizer. Don't have a full explanation, but in my application, Nadam (& NadamW) has stomped every other optimizer - worth trying.
Introspection: bottom section on 'learning' isn't worth much without this; don't just look at validation performance and call it a day - inspect the effect that adjusting a regularizer has on weights and activations. Evaluate using info toward bottom & relevant theory.
BONUS: weight decay can be powerful - even more powerful when done right; turns out, adaptive optimizers like Adam can harm its effectiveness, as described in this paper. Solution: use AdamW. My Keras/TensorFlow implementation here.
This is too much! Agreed - welcome to Deep Learning. Two tips here:
- Bayesian Optimization; will save you time especially on prohibitively expensive training.
Conv1D(strides > 1)
, for many timesteps (>1000
); slashes dimensionality, shouldn't harm performance (may in fact improve it).
Introspection Code:
Gradients: see this answer
Weights: see this answer
Weights l2 norm
rnn_weights = rnn_layer.get_weights() # returns [kernel, recurrent_kernel, bias], in order
kernel_l2norm = np.sqrt(np.sum(np.square(rnn_weights[0]), axis=0, keepdims=True))
recurrent_l2norm = np.sqrt(np.sum(np.square(rnn_weights[1]), axis=0, keepdims=True))
max_kernel_l2norm = np.max(kernel_l2norm) # `kernel_constraint` will check this
max_recurrent_l2norm = np.max(recurrent_l2norm) # `recurrent_constraint` will check this
Activations: see this answer
Weights: use .get_weights()
, organize to plot in histograms, per-gate. No code yet, but may link a future Q&A of mine.
How does 'learning' work?
The 'ultimate truth' of machine learning that is seldom discussed or emphasized is, we don't have access to the function we're trying to optimize - the test loss function. All of our work is with what are approximations of the true loss surface - both the train set and the validation set. This has some critical implications:
- Train set global optimum can lie very far from test set global optimum
- Local optima are unimportant, and irrelevant:
- Train set local optimum is almost always a better test set optimum
- Actual local optima are almost impossible for high-dimensional problems; for the case of the "saddle", you'd need the gradients w.r.t. all of the millions of parameters to equal zero at once
- Local attractors are lot more relevant; the analogy then shifts from "falling into a pit" to "gravitating into a strong field"; once in that field, your loss surface topology is bound to that set up by the field, which defines its own local optima; high LR can help exit a field, much like "escape velocity"
Further, loss functions are way too complex to analyze directly; a better approach is to localize analysis to individual layers, their weight matrices, and roles relative to the entire NN. Two key considerations are:
Feature extraction capability. Ex: the driving mechanism of deep classifiers is, given input data, to increase class separability with each layer's transformation. Higher quality features will filter out irrelevant information, and deliver what's essential for the output layer (e.g. softmax) to learn a separating hyperplane.
Information utility. Dead neurons, and extreme activations are major culprits of poor information utility; no single neuron should dominate information transfer, and too many neurons shouldn't lie purposeless. Stable activations and weight distributions enable gradient propagation and continued learning.
How does regularization work? read above first
In a nutshell, via maximizing NN's information utility, and improving estimates of the test loss function. Each regularization method is unique, and no two exactly alike - see "RNN regularizers".
RNN: Depth vs. Width: not as simple as "one is more nonlinear, other works in higher dimensions".
- RNN width is defined by (1) # of input channels; (2) # of cell's filters (output channels). As with CNN, each RNN filter is an independent feature extractor: more is suited for higher-complexity information, including but not limited to: dimensionality, modality, noise, frequency.
- RNN depth is defined by (1) # of stacked layers; (2) # of timesteps. Specifics will vary by architecture, but from information standpoint, unlike CNNs, RNNs are dense: every timestep influences the ultimate output of a layer, hence the ultimate output of the next layer - so it again isn't as simple as "more nonlinearity"; stacked RNNs exploit both spatial and temporal information.
来源:https://stackoverflow.com/questions/48714407/rnn-regularization-which-component-to-regularize