Publications

Uncurated image-text datasets: Shedding light on demographic bias

The increasing tendency to collect large and uncurated datasets to train vision-and-language models has raised concerns about fair …

Toward verifiable and reproducible human evaluation for text-to-image generation

Human evaluation is critical for validating the performance of text-to-image generative models, as this highly cognitive process …

Not only generative art: Stable diffusion for content-style disentanglement in art analysis

The duality of content and style is inherent to the nature of art. For humans, these two elements are clearly different: content refers …

Multi-modal humor segment prediction in video

Humor can be induced by various signals in the visual, linguistic, and vocal modalities emitted by humans. Finding humor in videos is …

Model-agnostic gender debiased image captioning

Image captioning models are known to perpetuate and amplify harmful societal bias in the training set. In this work, we aim to mitigate …

Learning bottleneck concepts in image classification

Interpreting and explaining the behavior of deep neural networks is critical for many tasks. Explainable AI provides a way to address …

ICDAR’23: Intelligent Cross-Data Analysis and Retrieval

Recently, there has been an increased interest in cross-data research problems, such as predicting air quality using life logging …

Real-time estimation of the remaining surgery duration for cataract surgery using deep convolutional neural networks and long short-term memory

Estimating the surgery length has the potential to be utilized as skill assessment, surgical training, or efficient surgical facility …

Inverse Rendering of Translucent Objects using Physical and Neural Renderers

In this work, we propose an inverse rendering model that estimates 3D shape, spatially-varying reflectance, homogeneous subsurface …

Human-Imperceptible Identification With Learnable Lensless Imaging

Lensless imaging protects visual privacy by capturing heavily blurred images that are imperceptible for humans to recognize the subject …

Development of a vertex finding algorithm using Recurrent Neural Network

Deep learning is a rapidly-evolving technology with the possibility to significantly improve the physics reach of collider experiments. …

Cross-language font style transfer

In this paper, we propose a cross-language font style transfer system that can synthesize a new font by observing only a few samples …

Contrastive Losses Are Natural Criteria for Unsupervised Video Summarization

Video summarization aims to select a most informative subset of frames in a video to facilitate efficient video browsing. Unsupervised …

Automated grading system of retinal arterio-venous crossing patterns: A deep learning approach replicating ophthalmologist’s diagnostic process of arteriolosclerosis

The morphological feature of retinal arterio-venous crossing patterns is a valuable source of cardiovascular risk stratification as it …

Gender and Racial Bias in Visual Question Answering Datasets

Acquiring a Dynamic Light Field Through a Single-Shot Coded Image

We propose a method for compressively acquiring a dynamic light field (a 5-D volume) through a single-shot coded image (a 2-D …

Information Extraction from Public Meeting Articles

Public meeting articles are the key to understanding the history of public opinion and public sphere in Australia. Information …

Tone Classification for Political Advertising Video using Multimodal Cues

Politics has always gotten much attention throughout history, and video advertisement has become one of the most essential tools for …

Multi-label disengagement and behavior prediction in online learning

Student disengagement prediction in online learning environments is beneficial in various ways, especially to help provide timely cues …

Match them up: visually explainable few-shot image classification

Few-shot learning (FSL) approaches, mostly neural network-based, assume that pre-trained knowledge can be obtained from base (seen) …

ICDAR'22: Intelligent Cross-Data Analysis and Retrieval

We have witnessed the rise of cross-data against multimodal data problems recently. The cross-modal retrieval system uses a textual …

Emotional Intensity Estimation based on Writer’s Personality

We propose a method for personalized emotional intensity estimation based on a writer’s personality test for Japanese SNS posts. …

Deep Gesture Generation for Social Robots Using Type-Specific Libraries

Body language such as conversational gesture is a powerful way to ease communication. Conversational gestures do not only make a speech …

Corpus Construction for Historical Newspapers: A Case Study on Public Meeting Corpus Construction Using OCR Error Correction

SCOUTER: Slot attention-based classifier for explainable image recognition

Explainable artificial intelligence has been gaining attention in the past few years. However, most existing methods are based on …

Image Retrieval by Hierarchy-aware Deep Hashing Based on Multi-task Learning

Deep hashing has been widely used to approximate nearest-neighbor search for image retrieval tasks. Most of them are trained with …

Explain me the painting: Multi-topic knowledgeable art description generation

Have you ever looked at a painting and wondered what is the story behind it? This work presents a framework to bring art closer to …

Built year prediction from Buddha face with heterogeneous labels

Buddha statues are a part of human culture, especially of the Asia area, and they have been alongside human civilisation for more than …

A comparative study of language Transformers for video question answering

With the goal of correctly answering questions about images or videos, visual question answering (VQA) has quickly developed in recent …

WRIME: A new dataset for emotional intensity estimation with subjective and objective annotations

We annotate 17,000 SNS posts with both the writer’s subjective emotional intensity and the reader’s objective one to construct a …

MTUNet: Few-shot image classification with visual explanations

Few-shot learning (FSL) approaches, mostly neural network-based, are assuming that the pre-trained knowledge can be obtained from base …

Noisy-LSTM: Improving temporal awareness for video semantic segmentation

Semantic video segmentation is a key challenge for various applications. This paper presents a new model named Noisy-LSTM, which is …

Preventing fake information generation against media clone attacks

Fake media has been spreading due to remarkable advances in media processing and machine leaning technologies, causing serious problems …

Generation and detection of media clones

With the spread of high-performance sensors and social network services (SNS) and the remarkable advances in machine learning …

Cross-lingual visual grounding

Visual grounding is a vision and language understanding task aiming at locating a region in an image according to a specific query …

Improving topic modeling through homophily for legal documents

Topic modeling that can automatically assign topics to legal documents is very important in the domain of computational law. The …

Diagnostic performance for pulmonary adenocarcinoma on CT: comparison of radiologists with and without three-dimensional convolutional neural network

Objectives To compare diagnostic performance for pulmonary invasive adenocarcinoma among radiologists with and without …

Red-Fluorescent Pt Nanoclusters for Detecting and Imaging HER2 in Breast Cancer Cells

Overexpression of human epidermal growth factor receptor 2 (HER2) is associated with more frequent cancer recurrence and metastasis. …

Knowledge-based video question answering with unsupervised scene descriptions

To understand movies, humans constantly reason over the dialogues and actions shown in specific scenes and relate them to the overall …

Acquiring dynamic light fields through coded aperture camera

We investigate the problem of compressive acquisition of a dynamic light field. A promising solution for compressive light field …

公開集会記事からの情報抽出

KnowIT VQA: Answering knowledge-based questions about videos

We propose a novel video understanding task by fusing knowledge-based and video question answering. First, we introduce KnowIT VQA, a …

Warmer Environments Increase Implicit Mental Workload Even If Learning Efficiency Is Enhanced

© Copyright © 2020 Kimura, Takemura, Nakashima, Kobori, Nagahara, Numao and Shinohara. Climate change is one of the most important …

Speech-driven face reenactment for a video sequence

We present a system for reenacting a person’s face driven by speech. Given a video sequence with the corresponding audio track of …

Joint learning of vessel segmentation and artery/vein classification with post-processing

Retinal imaging serves as a valuable tool for diagnosis of various diseases. However, reading retinal images is a difficult and …

ContextNet: representation and exploration for painting classification and retrieval in context

© 2019, The Author(s). In automatic art analysis, models that besides the visual elements of an artwork represent the relationships …

BERT representations for video question answering

Visual question answering (VQA) aims at answering questions about the visual content of an image or a video. Currently, most work on …

3D Image Reconstruction from Multi-focus Microscopic Images

This paper presents a method for reconstructing 3D image from multi-focus microscopic images captured with different focuses. We model …

歴史研究におけるビッグデータの活用-オーストラリアを中心に

Reflectance and Shape Estimation with a Light Field Camera Under Natural Illumination

Reflectance and shape are two important components in visually perceiving the real world. Inferring the reflectance and shape of an …

Deep-UV excitation fluorescence microscopy for detection of lymph node metastasis using deep neural network

Contextualized multi-sense word embedding

Currently, distributed word representations are employed in many natural language processing tasks. However, when generating one …

Human shape reconstruction with loose clothes from partially observed data by pose specific deformation

Reconstructing the entire body of moving human in a computer is important for various applications, such as tele-presence, virtual …

Deep compressive sensing for visual privacy protection in flatcam imaging

Detection followed by projection in conventional privacy cameras is vulnerable to software attacks that threaten to expose image sensor …

歴史新聞データからのコーパス構築

Fall detection using optical level anonymous image sensing system

Fall is one of the leading causes of injury for the elderly individuals. Systems that automatically detect falls can significantly …

Video meets knowledge in visual question answering

In this work, we address knowledge-based visual question answering in videos. First, we introduce KnowIT VQA, a video dataset with …

Negative lexically constrained decoding for paraphrase generation

Paraphrase generation can be regarded as monolingual translation. Unlike bilingual machine translation, paraphrase generation rewrites …

Historical and modern features for Buddha statue classification

© 2019 Copyright held by the owner/author(s). While Buddhism has spread along the Silk Roads, many pieces of art have been displaced. …

High-Speed Imaging Using CMOS Image Sensor With Quasi Pixel-Wise Exposure

Several recent studies on compressive video sensing realized scene capture beyond the fundamental trade-off limit between spatial …

Facial expression recognition with skip-connection to leverage low-level features

Deep convolutional neural networks (CNNs) have established their feet in the ground of computer vision and machine learning, used in …

Efficacy of Novel Multispectral Imaging Device to Determine Anastomosis for Esophagogastrostomy

© 2019 The Authors Background: Biomedical imaging devices that utilize the optical characteristics of hemoglobin (Hb) have become …

Controllable text simplification with lexical constraint loss

We propose a method to control the level of a sentence in a text simplification task. Text simplification is a monolingual translation …

A Coded Aperture for Watermark Extraction from Defocused Images

© 2019, Springer Nature Switzerland AG. Barcodes and 2D codes are widely used for various purposes, such as electronic payments and …

Visually grounded paraphrase extraction

Learning to capture light fields through a coded aperture camera

We propose a learning-based framework for acquiring a light field through a coded aperture camera. Acquiring a light field is a …

Graphical classification of DNA sequences of HLA alleles by deep learning

© 2018 The Author(s) Alleles of human leukocyte antigen (HLA)-A DNAs are classified and expressed graphically by using artificial …

Coherent anti-stokes Raman scattering rigid endoscope toward robot-assisted surgery

© 2018 Optical Society of America. Label-free visualization of nerves and nervous plexuses will improve the preservation of …

Adapting local features for face detection in thermal image

A thermal camera captures the temperature distribution of a scene as a thermal image. In thermal images, facial appearances of …

Augmented reality marker hiding with texture deformation

Augmented reality (AR) marker hiding is a technique to visually remove AR markers in a real-time video stream. A conventional approach …

Novel view synthesis with light-weight view-dependent texture mapping for a stereoscopic HMD

The proliferation of off-the-shelf head-mounted displays (HMDs) let end-users enjoy virtual reality applications, some of which render …

Video question answering to find a desired video eegment

Unsupervised Video Summarization using Deep Video Features

Incremental structural modeling on sparse visual SLAM

© 2017 MVA Organization All Rights Reserved. This paper presents an incremental structural modeling approach that improves the …

Increasing pose comprehension through augmented reality reenactment

Standard video does not capture the 3D aspect of human motion, which is important for comprehension of motion that may be ambiguous. In …

Fine-grained video retrieval for multi-clip video