Self-Supervised Monocular Depth and Motion Learning in Dynamic Scenes: Semantic Prior to Rescue

Research output: Contribution to journalArticlepeer-review

11 Scopus citations

Abstract

We introduce an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion, and depth in a monocular camera setup without geometric supervision. Our technical contributions are three-fold. First, we highlight the fundamental difference between inverse and forward projection while modeling the individual motion of each rigid object, and propose a geometrically correct projection pipeline using a neural forward projection module. Second, we propose two types of residual motion learning frameworks to explicitly disentangle camera and object motions in dynamic driving scenes with different levels of semantic prior knowledge: video instance segmentation as a strong prior, and object detection as a weak prior. Third, we design a unified photometric and geometric consistency loss that holistically imposes self-supervisory signals for every background and object region. Lastly, we present a unsupervised method of 3D motion field regularization for semantically plausible object motion representation. Our proposed elements are validated in a detailed ablation study. Through extensive experiments conducted on the KITTI, Cityscapes, and Waymo open dataset, our framework is shown to outperform the state-of-the-art depth and motion estimation methods. Our code, dataset, and models are publicly available.

Original languageEnglish
Pages (from-to)2265-2285
Number of pages21
JournalInternational Journal of Computer Vision
Volume130
Issue number9
DOIs
StatePublished - Sep 2022

Bibliographical note

Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.

Keywords

  • 3D visual perception
  • Monocular depth estimation
  • Motion estimation
  • Self-supervised learning

Fingerprint

Dive into the research topics of 'Self-Supervised Monocular Depth and Motion Learning in Dynamic Scenes: Semantic Prior to Rescue'. Together they form a unique fingerprint.

Cite this