DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

Teaser figure.
Given a set of multi-view images as input, DiffusionSfM parametrizes scene geometry and cameras as pixel-wise ray origins and endpoints in a global frame and learns a denoising diffusion model to infer these from multi-view input. In contrast to current Structure-from-Motion pipelines, which often adopt a two-stage approach of pairwise reasoning followed by global optimization, our method unifies both stages into a single end-to-end multi-view reasoning step.

Abstract

Current Structure-from-Motion (SfM) methods often adopt a two-stage pipeline involving learned or geometric pairwise reasoning followed by a global optimization. We instead propose a data-driven multi-view reasoning approach that directly infers cameras and 3D geometry from multi-view images. Our proposed framework, DiffusionSfM, parametrizes scene geometry and cameras as pixel-wise ray origins and endpoints in a global frame, and learns a transformer-based denoising diffusion model to predict these from multi-view input. We develop mechanisms to overcome practical challenges in training diffusion models with missing data and unbounded scene coordinates, and demonstrate that DiffusionSfM allows accurate prediction of 3D and cameras. We empirically validate our approach on challenging real world data and find that DiffusionSfM improves over prior classical and learning-based methods, while also naturally modeling uncertainty and allowing external guidance to be incorporated in inference.

Method

Given sparse multi-view images as input, DiffusionSfM predicts pixel-wise ray origins and endpoints for each image in a global frame using a denoising diffusion process. During training, it is conditioned on a depth mask to handle missing or invalid ground truth depth common in real-world datasets, e.g., CO3D. At inference, the depth mask is set to all ones, enabling the model to predict origins and endpoints for all pixels.

Method figure.

Results

Quantitative Comparison of Camera Pose Accuracy on CO3D

On the left, we report the proportion of relative camera rotations within 15° of the ground truth. On the right, we report the proportion of camera centers within 10% of the scene scale. To align the predicted camera centers to ground-truth, we apply an optimal similarity transform, hence the alignment is perfect at N=2 but worsens with more images. DiffusionSfM outperforms all other methods for camera center accuracy, and outperforms all methods trained on equivalent data for rotation accuracy.

SPARF Comparison figure.

Quantitative Comparison of Predicted Geometry and Focal Length on CO3D Unseen Categories

Top: Chamfer Distance (CD) computed over all scene points. Middle: CD computed on foreground points only. Bottom: Percentage of predicted focal lengths within 15% of the ground truth. RD+MoGe refers to the Ray Diffusion camera pose, with depth estimates from MoGe aligned to the ground truth. DUSt3R-CO3D is trained solely on CO3D, while DUSt3R-all is trained on multiple datasets. DiffusionSfM outperforms all methods in terms of full scene geometry and estimated focal length, and also outperforms both RD+MoGe and DUSt3R-CO3D on foreground geometry.

LEAP Comparison figure.

Qualitative Comparison of Predicted Geometry and Camera Poses

DiffusionSfM shows superior capabilities in handling challenging samples, e.g., the skateboard and tennis ball. Additionally, while we observe that DUSt3R-all can predict highly precise camera rotations, it often struggles with camera centers (see the keyboard and backpack examples).

LEAP Comparison figure.
LEAP Comparison figure.

Visualizations of the Effect of Mono-Depth Diffusion Guidance

We utilize mono-depth estimates from MeGe to guide the x_0-prediction from our model towards more accurate, clean estimates. This guidance enhances the quality of the predicted geometry while preserving multi-view consistency.

LEAP Comparison figure.