BERT representations for video question answering

Zekun Yang, Noa Garcia, Chenhui Chu, Mayu Otani, Yuta Nakashima, Haruo Takemura

January, 2020

Abstract

Visual question answering (VQA) aims at answering questions about the visual content of an image or a video. Currently, most work on VQA is focused on image-based question answering, and less attention has been paid into answering questions about videos. However, VQA in video presents some unique challenges that are worth studying: it not only requires to model a sequence of visual features over time, but often it also needs to reason about associated subtitles. In this work, we propose to use BERT, a sequential modelling technique based on Transformers, to encode the complex semantics from video clips. Our proposed model jointly captures the visual and language information of a video scene by encoding not only the subtitles but also a sequence of visual concepts with a pretrained language-based Transformer. In our experiments, we exhaustively study the performance of our model by taking different input arrangements, showing outstanding improvements when compared against previous work on two well-known video VQA datasets: TVQA and Pororo.

Type

Conference paper

Publication

Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020

kvqa

BERT representations for video question answering

Abstract

Zekun Yang

PhD Student

Noa Garcia

Associate Professor (Concurrent)

Chenhui Chu

Guest Associate Professor

Yuta Nakashima

Professor