Revisiting Pixel-Level Contrastive Pre-Training on Scene Images

Zongshang Pang, Yuta Nakashima, Mayu Otani, Hajime Nagahara

January, 2024

Abstract

Contrastive image representation learning through instance discrimination has shown impressive transfer performance. Recent strategies have focused on pushing the limit of their transfer performance for dense prediction tasks, particularly when conducting pre-training on scene images with complex structures. Initial approaches employ pixel-level contrastive pre-training to optimize dense spatial features, while subsequent methods utilize region-mining algorithms to capture holistic regional semantics and address the issue of semantically inconsistent scene image crops. In this paper, we revisit pixel-level contrastive pre-training on scene images. Contrary to the assumption that pixel-level learning falls short in achieving these objectives, we demonstrate its under-explored potentials: (1) it can effectively learn holistic regional semantics more simply compared to region-level methods, and (2) it intrinsically provides tools to mitigate the impact of semantically inconsistent views involved with scene-level training images. We propose PixCon, a pixel-level contrastive learning framework, and explore two variants with different positive matching strategies to investigate the potential of pixel-level learning. Additionally, when PixCon incorporates a novel semantic reweighting approach tailored for scene image pre-training, it outperforms or matches the performance of previous region-level methods in object detection and semantic segmentation tasks across multiple benchmarks.

Type

Conference paper

Publication

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Revisiting Pixel-Level Contrastive Pre-Training on Scene Images

Abstract

Zongshang Pang

PhD Student

Yuta Nakashima

Professor

Hajime Nagahara

Professor