【Deep Learning】Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

TLDR; The authors use an attention mechanism in image caption generation, allowing the decoder RNN focus on specific parts of the image. In order find the correspondence between words and image patches, the RNN uses a lower convolutional layer as its input (before pooling). The authors propose both a “hard” attention (trained using sampling methods) and “soft” attention (trained end-to-end) mechanism, and show qualitatively that the decoder focuses on sensible regions while generating text, adding an additional layer of interpretability to the model. The attention-based models achieve state-of-the art on Flickr8k, Flickr30 and MS Coco.

Key Points

To find image correspondence use lower convolutional layers to attend to.
Two attention mechanisms: Soft and hard. Depending on evaluation metric (BLEU vs. METERO) one or the other performs better.
Largest data set (MS COCO) takes 3 days to train on Titan Black GPU. Oxford VGG.
Soft attention is same as for seq2seq models.
Attention weights are visualized by upsampling and applying a Gaussian

Notes/Questions

Would’ve liked to see an explanation of when/how soft vs. hard attention does better.
What is the computational overhead of using the attention mechanism? Is it significant?

来源：CSDN

作者：DrogoZhang

链接：https://blog.csdn.net/weixin_40400177/article/details/103605623

标签

attention

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!