Tjark Behrens1,
Anton Obukhov3,
Bingxin Ke1,
Fabio Tosi2,
Matteo Poggi2,
Konrad Schindler1
1ETH Zurich |
2University of Bologna |
3Huawei Bayer Lab
CVPR 2026 Findings
This repository is the official implementation of the paper titled "StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space" (accepted at CVPR 2026 Findings).
Create and activate the environment:
git clone https://github.com/prs-eth/stereospace.git
cd stereospace
python -m venv ~/venv_stereospace
source ~/venv_stereospace/bin/activate
pip install -r requirements.txtpython inference.pyThis will:
- ⬇️ Download the necessary checkpoints. If you are prompted to log in, please provide a read access token from Hugging Face → Settings → Access Tokens. When asked 'Add token as git credential? (Y/n)', select 'n'.
- 👀 Create stereo from input images; without specifying
--input, it will use theexample_imagesdirectory. - 💾 Save predictions to an output folder.
You can also pass the following arguments:
--input INPUT: Input image or a directory path, default./example_images;--output OUTPUT: Output directory, default./outputs;--baseline BASELINE: Baseline, default0.15(15 cm);--batch_size BATCH_SIZE: Batch size when processing a folder of images, default is 1;--src_intrinsics,--tgt_intrinsics: Camera intrinsics for precise control of the FOV, default is a standard camera.
We train on the datasets referenced in our paper. Please obtain the raw data from the original dataset providers and follow their respective licenses/terms.
To simplify training across multiple sources, we convert each dataset into a common, flat directory structure where each stereo sample is stored as a single .npz file containing:
- left / right image
- left-to-right / right-to-left disparity
- camera intrinsics
- stereo baseline
Each .npz contains the following keys:
| Key | Type / shape | Description |
|---|---|---|
left |
uint8, (H, W, 3) |
Left RGB image |
right |
uint8, (H, W, 3) |
Right RGB image |
disp_l2r |
float32, (H, W) |
Disparity map (left → right). Optional. |
disp_r2l |
float32, (H, W) |
Disparity map (right → left). Optional. |
intrinsics |
float32, (3, 3) |
Camera intrinsics matrix |
baseline |
float32 or (1,) |
Stereo baseline (same units as disparities are derived from) |
Notes
- Some datasets may provide only one disparity direction. In that case the missing key can be omitted; training will treat it as unavailable.
- Sources: UnrealStereo, Sintel, PLTD3, TartanAir, SpringStereo, Vkitti2, FATStereo, SimStereo, Infinigen, IRSStereo, DynamicReplica, LayeredFlow, NerfStereo, SceneSplat (Hypersim, Replica, ScanNet).
- If you store the data in a different location, please specify so in the
train.yaml:data.data_dir: "$CUSTOM_PATH"
Download pre-trained Stable Diffusion (v2, 768x768) checkpoints and place it inside a weights/stable-diffusion-2. Download the CLIP ViT-H/14 - LAION-2B text encoder and place it inside the stable-diffusion-2 subfolder.
This repo supports three launch modes.
python training.py --config configs/train.yaml2) Single node, multi-GPU (Accelerate)
Use Hugging Face Accelerate to launch multi-process training on one machine.accelerate launch \
--num_processes=$GPUS \
training.py --config configs/train.yaml3) Multi-node, multi-GPU (torchrun)
For distributed training across multiple machines, use PyTorch torchrun. The environment variables should be set based on the available hardware and can be deduced from the used scheduler.
torchrun \
--nnodes=$NNODES \
--nproc_per_node=$GPUS_PER_NODE \
--node_rank=$NODE_RANK \
--rdzv_backend=c10d \
--rdzv_endpoint="$RDZV_HOST:$RDZV_PORT" \
--rdzv_id="$RDZV_ID" \
training.py --config configs/train.yaml| Problem | Solution |
|---|---|
(pip) Errors installing requirements via pip install -r requirements.txt |
python -m pip install --upgrade pip |
Please cite our paper:
@misc{behrens2025stereospace,
title = {StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space},
author = {Tjark Behrens and Anton Obukhov and Bingxin Ke and Fabio Tosi and Matteo Poggi and Konrad Schindler},
year = {2025},
eprint = {2512.10959},
archivePrefix= {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2512.10959},
}The code and models of this work are licensed under the MIT License. By downloading and using the code and model you agree to the terms in LICENSE.

