Deep Stereo

CVPR 2023

Fabio Tosi1
Alessio Tonioni2
Daniele De Gregorio3
Matteo Poggi1

University of Bologna1
Google Inc.2
Paper Supplement Poster Video Code Dataset

On top, predictions by RAFT-Stereo trained with our approach on user-collected images, without using any synthetic datasets, ground-truth depth or (even) real stereo pairs. At the bottom, a zoom-in over the Backpack disparity map, showing an unprecedented level of detail compared to existing strategies not using ground-truth trained with the same data.


"We introduce a novel framework for training deep stereo networks effortlessly and without any ground-truth. By leveraging state-of-the-art neural rendering solutions, we generate stereo training data from image sequences collected with a single handheld camera. On top of them, a NeRF-supervised training procedure is carried out, from which we exploit rendered stereo triplets to compensate for occlusions and depth maps as proxy labels. This results in stereo networks capable of predicting sharp and detailed disparity maps. Experimental results show that models trained under this regime yield a 30-40% improvement over existing self-supervised methods on the challenging Middlebury dataset, filling the gap to supervised models and, most times, outperforming them at zero-shot generalization."


1 - Training Data Generation

  • Image Collection and COLMAP Pre-processing. Acquire a set of images from a single static scene and estimate intrinsic and extrinsic camera parameters using COLMAP, which is a standard procedure for preparing user-collected data to be used in Neural Radiance Field (NeRF).

  • NeRF Training. Train an independent NeRF for each scene by rendering color for a batch of rays from collected image positions and optimizing an L2 loss with respect to pixel colors in the collected frames. In our work, we adopt Instant-NGP as the NeRF engine.

  • Stereo Pairs Rendering. We define multiple virtual stereo cameras for each trained NeRF model. Then, we simultaneously render binocular stereo pairs at arbitrary spatial resolution while extracting disparity maps and uncertainty to train deep stereo networks. Additionally, we render a third image to the left of the reference frame of each stereo pair, which produces a perfectly rectified stereo triplet.

2 - NeRF-Supervised Stereo Regime

Data generated so far is used to train stereo models. Given a rendered image triplet \((I_l, I_c, I_r)\), we estimate a disparity map by feeding the network with \((I_c, I_r)\) as the left and right views of a standard stereo pair. Then, we propose an NS loss with two terms:

  • Triplet Photometric Loss. We measure the photometric difference between the warped reference image (using both the left and right images of the triplet) and \( I_c\) by adopting the Structural Similarity Index Measure (SSIM) and absolute pixel difference.

  • Rendered Disparity Loss. We further assist the photometric loss by exploiting an additional loss between the predictions and the rendered disparities by NeRF. A filtering mechanism based on the rendered uncertainty is employed to retain only the most reliable pixels.

The two terms are summed with weights balancing the impact of photometric and disparity losses. This completes our NeRF-Supervised training regime.

Youtube Video

Collected Dataset

We collect a total of 270 high-resolution scenes (≈8Mpx) in both indoor and outdoor environments using standard camera-equipped smartphones . For each scene, we focus on a/some specific object(s) and acquire 100 images from different viewpoints, ensuring that the scenery is completely static. The acquisition protocol involves a set of either front-facing or 360 views. Here we report individual examples derived from 30 different scenes that comprise our dataset.

Coming Soon: Upload Your Scene!

Would you like to contribute to expanding our dataset in order to obtain more robust and accurate stereo models in every scenario? Upload your images via a zip file, and we will take care of processing them using NeRF and retraining the stereo models.


Qualitative Results

From left to right: reference image and disparity map obtained by training RAFT-Stereo using the popular image reconstruction loss function between binocular stereo pairs \( \mathcal{L}_\rho \), the photometric loss between horizontally aligned triples \( \mathcal{L}_{3\rho} \), disparity supervision from proxy labels extracted using the method proposed in Aleotti et al. [3], and our NeRF-Supervised loss paradigm.

Qualitative results on ETH3D. We show reference images and disparity maps predicted by RAFT trained using our NeRF-Supervised loss paradigm.


		  author    = {Tosi, Fabio and Tonioni, Alessio and De Gregorio, Daniele and Poggi, Matteo},
		  title     = {NeRF-Supervised Deep Stereo},
		  booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
		  month     = {June},
		  year      = {2023},
		  pages     = {855-866}

* This is not an officially supported Google product.