I used Transformer model for Image Captioning, model takes image as input and then some convolution layers and then i have converted the result into single sequence and this seq