Revisiting Pixel-Level Contrastive Pre-Training on Scene Images

概要

Contrastive image representation learning through instance discrimination has shown impressive transfer performance. Recent strategies have focused on pushing the limit of their transfer performance for dense prediction tasks, particularly when conducting pre-training on scene images with complex structures. Initial approaches employ pixel-level contrastive pre-training to optimize dense spatial features, while subsequent methods utilize region-mining algorithms to capture holistic regional semantics and address the issue of semantically inconsistent scene image crops. In this paper, we revisit pixel-level contrastive pre-training on scene images. Contrary to the assumption that pixel-level learning falls short in achieving these objectives, we demonstrate its under-explored potentials: (1) it can effectively learn holistic regional semantics more simply compared to region-level methods, and (2) it intrinsically provides tools to mitigate the impact of semantically inconsistent views involved with scene-level training images. We propose PixCon, a pixel-level contrastive learning framework, and explore two variants with different positive matching strategies to investigate the potential of pixel-level learning. Additionally, when PixCon incorporates a novel semantic reweighting approach tailored for scene image pre-training, it outperforms or matches the performance of previous region-level methods in object detection and semantic segmentation tasks across multiple benchmarks.

論文種別
発表文献
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Zongshang Pang
Zongshang Pang
博士後期課程学生
中島悠太
中島悠太
教授

コンピュータビジョン・パターン認識などの研究。ディープニューラルネットワークなどを用いた画像・映像の認識・理解を主に、自然言語処理を援用した応用研究などに従事。

長原一
長原一
教授

コンピューテーショナルフォトグラフィ、コンピュータビジョンを専門とし実世界センシングや情報処理技術、画像認識技術の研究を行う。さらに、画像センシングにとどまらず様々なセンサに拡張したコンピュテーショナルセンシング手法の開発や高次元で冗長な実世界ビッグデータから意味のある情報を計測するスパースセンシングへの転換を目指す。