Find in Library
Search millions of books, articles, and more
Indexed Open Access Databases
A Multimodal Framework for Video Caption Generation
oleh: Reshmi S. Bhooshan, Suresh K.
Format: | Article |
---|---|
Diterbitkan: | IEEE 2022-01-01 |
Deskripsi
Video captioning is a highly challenging computer vision task that automatically describes the video clips using natural language sentences with a clear understanding of the embedded semantics. In this work, a video caption generation framework consisting of discrete wavelet convolutional neural architecture along with multimodal feature attention is proposed. Here global, contextual and temporal features in the video frames are taken into account and separate attention networks are integrated into the visual attention predictor network to capture multiple attentions from these features. These attended features with textual attention are employed in the visual-to-text translator for caption generation. The experiments are conducted on two benchmark video captioning datasets - MSVD and MSR-VTT. The results prove an improved performance of the method with a CIDEr score of 91.7 and 52.2, for the aforementioned datasets, respectively.