Abstract
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion, and depth in a monocular camera setup without supervision. Our technical contributions are three-fold. First, we highlight the fundamental difference between inverse and forward projection while modeling the individual motion of each rigid object, and propose a geometrically correct projection pipeline using a neural forward projection module. Second, we design a unified instance-aware photometric and geometric consistency loss that holistically imposes self-supervisory signals for every background and object region. Lastly, we introduce a general-purpose auto-annotation scheme using any off-the-shelf instance segmentation and optical flow models to produce video instance segmentation maps that will be utilized as input to our training pipeline. These proposed elements are validated in a detailed ablation study. Through extensive experiments conducted on the KITTI and Cityscapes dataset, our framework is shown to outperform the state-of-the-art depth and motion estimation methods. Our code, dataset, and models are publicly available.
Original language | English |
---|---|
Title of host publication | 35th AAAI Conference on Artificial Intelligence, AAAI 2021 |
Publisher | Association for the Advancement of Artificial Intelligence |
Pages | 1863-1872 |
Number of pages | 10 |
ISBN (Electronic) | 9781713835974 |
DOIs | |
State | Published - 2021 |
Event | 35th AAAI Conference on Artificial Intelligence, AAAI 2021 - Virtual, Online Duration: 2 Feb 2021 → 9 Feb 2021 |
Publication series
Name | 35th AAAI Conference on Artificial Intelligence, AAAI 2021 |
---|---|
Volume | 3A |
Conference
Conference | 35th AAAI Conference on Artificial Intelligence, AAAI 2021 |
---|---|
City | Virtual, Online |
Period | 2/02/21 → 9/02/21 |
Bibliographical note
Publisher Copyright:Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved