I want to predict the next frame of a (greyscale) video given N
previous frames - using CNNs or RNNs in Keras. Most tutorials and other information regarding ti
After doing lots of research, I finally stumbled upon the Keras Example for the ConvLSTM2D
layer (Already mentioned by Marcin Możejko), which does exactly what I need.
In the current version of Keras (v1.2.2), this layer is already included and can be imported using
from keras.layers.convolutional_recurrent import ConvLSTM2D
To use this layer, the video data has to be formatted as follows:
[nb_samples, nb_frames, width, height, channels] # if using dim_ordering = 'tf'
[nb_samples, nb_frames, channels, width, height] # if using dim_ordering = 'th'
So basically every approach has its advantages and disadvantages. Let's go throught the ones you provided and then other to find the best approach:
LSTM
: Among their biggest advantages is an ability to learn a long-term dependiencies patterns in your data. They were designed in order to be able to analyse long sequences like e.g. speech or text. This is also might cause problems because of number parameters which could be really high. Other typical recurrent network architectures like GRU
might overcome this issues. The main disadvantage is that in their standard (sequential implementation) it's infeasible to fit it on a video data for the same reason why dense layers are bad for an imagery data - loads of time and spatial invariances must be learnt by a topology which is completely not suited for catching them in an efficient manner. Shifting a video by a pixel to the right might completely change the output of your network.
Other thing which is worth to mention is that training LSTM
is belived to be similiar to finding equilibrium between two rivalry processes - finding good weights for a dense-like output computations and finding a good inner-memory dynamic in processing sequences. Finding this equilibrium might last for a really long time but once its finded - it's usually quite stable and produces a really good results.
Conv3D
: Among their biggest advantages one may easily find an ability to catch spatial and temporal invariances in the same manner as Conv2D
in an imagery case. This make the curse of dimensionality much less harmful. On the other hand - in the same way as Conv1D
might not produce good results with a longer sequences - in the same way - a lack of any memory might make learning a long sequence harder.
Of course one may use different approaches like:
TimeDistributed + Conv2D
: using a TimeDistributed
wrapper - one may use some pretrained convnet like e.g. Inception
framewise and then analyse the feature maps sequentially. A really huge advantage of this approach is a possibility of a transfer learning. As a disadvantage - one may think about it as a Conv2.5D
- it lacks temporal analysis of your data.
ConvLSTM
: this architecture is not yet supported by the newest version of Keras
(on March 6th 2017) but as one may see here it should be provided in the future. This is a mixture of LSTM
and Conv2D
and it's belived to be better then stacking Conv2D
and LSTM
.
Of course these are not the only way to solve this problem, I'll mention one more which might be usefull:
TimeDistributed(ResNet)
then output is feed to Conv3D
with multiple and agressive spatial pooling and finally transformed by an GRU/LSTM
layer. PS:
One more thing that is also worth to mention is that shape of video data is actually 4D
with (frames, width, height, channels
).
PS2:
In case when your data is actually 3D
with (frames, width, hieght)
you actually could use a classic Conv2D
(by changing channels
to frames
) to analyse this data (which actually might more computationally effective). In case of a transfer learning you should add additional dimension because most of CNN
models were trained on data with shape (width, height, 3)
. One may notice that your data doesn't have 3 channels. In this case a technique which is usually used is repeating spatial matrix three times.
PS3:
An example of this 2.5D
approach is:
input = Input(shape=input_shape)
base_cnn_model = InceptionV3(include_top=False, ..)
temporal_analysis = TimeDistributed(base_cnn_model)(input)
conv3d_analysis = Conv3D(nb_of_filters, 3, 3, 3)(temporal_analysis)
conv3d_analysis = Conv3D(nb_of_filters, 3, 3, 3)(conv3d_analysis)
output = Flatten()(conv3d_analysis)
output = Dense(nb_of_classes, activation="softmax")(output)