Review of Visual Question Answering Technology

oleh: WANG Yu, SUN Haichun

Format: Article
Diterbitkan: Journal of Computer Engineering and Applications Beijing Co., Ltd., Science Press 2023-07-01

Deskripsi

Visual question answering (VQA) is a popular cross-modal task that combines natural language pro-cessing and computer vision techniques. The main objective of this task is to enable computers to intelligently recognize and retrieve visual content and provide accurate answers. VQA involves the integration of multiple technologies such as object recognition and detection, intelligent question answering, image attribute classification, and scene analysis. It can support a wide range of cutting-edge interactive AI tasks such as visual dialogue and visual navigation, and has broad application prospects and great value. Over the past few years, the development of computer vision, natural language processing, and cross-modal AI models has provided many new technologies and methods for achieving the task of visual question answering. This paper mainly summarizes the mainstream models and specialized datasets in the field of visual question answering between 2019 and 2022. Firstly, this paper provides a review and discussion of the mainstream technical methods used in the key steps of implementing the visual question answering task, based on the module framework. Next, it subdivides various types of models in this field according to the technical methods adopted by mainstream models, and briefly introduces their improvement focus and limitations. Then, it summarizes the commonly used datasets and evaluation metrics for visual question answering, and compares and discusses the performance of several typical models. Finally, this paper focuses on the key issues that need to be addressed in the current visual question answering field, and predicts and prospects the future application and technological development in this field.