BERT representations for video question answering

Zekun Yang, Noa Garcia, Chenhui Chu, Mayu Otani, 中島悠太, Haruo Takemura

1月, 2020

概要

Visual question answering (VQA) aims at answering questions about the visual content of an image or a video. Currently, most work on VQA is focused on image-based question answering, and less attention has been paid into answering questions about videos. However, VQA in video presents some unique challenges that are worth studying: it not only requires to model a sequence of visual features over time, but often it also needs to reason about associated subtitles. In this work, we propose to use BERT, a sequential modelling technique based on Transformers, to encode the complex semantics from video clips. Our proposed model jointly captures the visual and language information of a video scene by encoding not only the subtitles but also a sequence of visual concepts with a pretrained language-based Transformer. In our experiments, we exhaustively study the performance of our model by taking different input arrangements, showing outstanding improvements when compared against previous work on two well-known video VQA datasets: TVQA and Pororo.

論文種別

Conference paper

発表文献

Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020

kvqa

BERT representations for video question answering

概要

Zekun Yang

博士後期課程学生

Noa Garcia

准教授（兼任）

Chenhui Chu

招へい准教授

中島悠太

教授