A comparative study of language Transformers for video question answering

Zekun Yang, Noa Garcia, Chenhui Chu, Mayu Otani, Yuta Nakashima, Haruo Takemura

July, 2021

Abstract

With the goal of correctly answering questions about images or videos, visual question answering (VQA) has quickly developed in recent years. However, current VQA systems mainly focus on answering questions about a single image and face many challenges in answering video-based questions. VQA in video not only has to understand the evolution between video frames but also requires a certain understanding of corresponding subtitles. In this paper, we propose a language Transformer-based video question answering model to encode the complex semantics from video clips. Different from previous models which represent visual features by recurrent neural networks, our model encodes visual concept sequences with a pre-trained language Transformer. We investigate the performance of our model using four language Transformers over two different datasets. The results demonstrate outstanding improvements compared to previous work.

Type

Journal article

Publication

Neurocomputing

A comparative study of language Transformers for video question answering

Abstract

Zekun Yang

PhD Student

Noa Garcia

Specially-Appointed Assistant Professor

Chenhui Chu

Guest Associate Professor

Yuta Nakashima

Professor