StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

Tjark Behrens¹, Anton Obukhov³, Bingxin Ke¹, Fabio Tosi², Matteo Poggi², Konrad Schindler¹
_{¹ETH Zurich |
²University of Bologna |
³Huawei Bayer Lab}

CVPR 2026 Findings

This repository is the official implementation of the paper titled "StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space" (accepted at CVPR 2026 Findings).

Quick Start

Environment & Requirements

Create and activate the environment:

git clone https://github.com/prs-eth/stereospace.git
cd stereospace
python -m venv ~/venv_stereospace
source ~/venv_stereospace/bin/activate
pip install -r requirements.txt

Inference

python inference.py

This will:

⬇️ Download the necessary checkpoints. If you are prompted to log in, please provide a read access token from Hugging Face → Settings → Access Tokens. When asked 'Add token as git credential? (Y/n)', select 'n'.
👀 Create stereo from input images; without specifying --input, it will use the example_images directory.
💾 Save predictions to an output folder.

You can also pass the following arguments:

--input INPUT: Input image or a directory path, default ./example_images;
--output OUTPUT: Output directory, default ./outputs;
--baseline BASELINE: Baseline, default 0.15 (15 cm);
--batch_size BATCH_SIZE: Batch size when processing a folder of images, default is 1;
--src_intrinsics, --tgt_intrinsics: Camera intrinsics for precise control of the FOV, default is a standard camera.

Training

Data

We train on the datasets referenced in our paper. Please obtain the raw data from the original dataset providers and follow their respective licenses/terms.

To simplify training across multiple sources, we convert each dataset into a common, flat directory structure where each stereo sample is stored as a single .npz file containing:

left / right image
left-to-right / right-to-left disparity
camera intrinsics
stereo baseline

Each .npz contains the following keys:

Key	Type / shape	Description
`left`	`uint8`, `(H, W, 3)`	Left RGB image
`right`	`uint8`, `(H, W, 3)`	Right RGB image
`disp_l2r`	`float32`, `(H, W)`	Disparity map (left → right). Optional.
`disp_r2l`	`float32`, `(H, W)`	Disparity map (right → left). Optional.
`intrinsics`	`float32`, `(3, 3)`	Camera intrinsics matrix
`baseline`	`float32` or `(1,)`	Stereo baseline (same units as disparities are derived from)

Notes

Some datasets may provide only one disparity direction. In that case the missing key can be omitted; training will treat it as unavailable.
Sources: UnrealStereo, Sintel, PLTD3, TartanAir, SpringStereo, Vkitti2, FATStereo, SimStereo, Infinigen, IRSStereo, DynamicReplica, LayeredFlow, NerfStereo, SceneSplat (Hypersim, Replica, ScanNet).
If you store the data in a different location, please specify so in the train.yaml: data.data_dir: "$CUSTOM_PATH"

Checkpoints

Download pre-trained Stable Diffusion (v2, 768x768) checkpoints and place it inside a weights/stable-diffusion-2. Download the CLIP ViT-H/14 - LAION-2B text encoder and place it inside the stable-diffusion-2 subfolder.

Running Training Pipeline

This repo supports three launch modes.

1) Single GPU

python training.py --config configs/train.yaml

2) Single node, multi-GPU (Accelerate)

Use Hugging Face Accelerate to launch multi-process training on one machine.

accelerate launch \
  --num_processes=$GPUS \
  training.py --config configs/train.yaml

3) Multi-node, multi-GPU (torchrun)

For distributed training across multiple machines, use PyTorch torchrun. The environment variables should be set based on the available hardware and can be deduced from the used scheduler.

torchrun \
  --nnodes=$NNODES \
  --nproc_per_node=$GPUS_PER_NODE \
  --node_rank=$NODE_RANK \
  --rdzv_backend=c10d \
  --rdzv_endpoint="$RDZV_HOST:$RDZV_PORT" \
  --rdzv_id="$RDZV_ID" \
  training.py --config configs/train.yaml

Troubleshooting

Problem	Solution
(pip) Errors installing requirements via `pip install -r requirements.txt`	`python -m pip install --upgrade pip`

Citation

Please cite our paper:

@misc{behrens2025stereospace,
  title        = {StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space},
  author       = {Tjark Behrens and Anton Obukhov and Bingxin Ke and Fabio Tosi and Matteo Poggi and Konrad Schindler},
  year         = {2025},
  eprint       = {2512.10959},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV},
  url          = {https://arxiv.org/abs/2512.10959},
}

License

The code and models of this work are licensed under the MIT License. By downloading and using the code and model you agree to the terms in LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

Quick Start

Environment & Requirements

Inference

Training

Data

Checkpoints

Running Training Pipeline

1) Single GPU

Troubleshooting

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
doc/images		doc/images
example_images		example_images
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt
training.py		training.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

Quick Start

Environment & Requirements

Inference

Training

Data

Checkpoints

Running Training Pipeline

1) Single GPU

Troubleshooting

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages