perf: cap planner budget when model dwarfs the streaming budget by fszontagh · Pull Request #1612 · leejet/stable-diffusion.cpp

fszontagh · 2026-06-06T16:11:37Z

Summary

When the model is much bigger than --max-vram, the planner currently merges the base segments into 1-2 huge merged segments. The follow-up worst_merged_segment_footprint reservation in compute_streaming_segments then leaves almost nothing for chunk-K residency, so on a model like Z-Image bf16 (11.7 GB on a 12 GB GPU) chunk-K stays near 0 and every sampling step re-uploads the full model.

When the model fits comfortably in the budget the current behaviour is already optimal: one big merged segment, chunk-K covers the whole model, no per-step H2D.

This PR caps the budget passed to resolve_plan at a quarter of the streaming budget when total_params_bytes > 0.75 * effective_budget, otherwise it passes the full budget. Smaller merged segments shrink the worst_merged_segment_footprint reservation downstream, which frees enough residency budget for a meaningful chunk-K. Small/quantized models are unaffected.

Quantization-aware: total_params_bytes is summed via ggml_nbytes, so a Q8/Q4 model is correctly identified as small relative to the budget.

Related Issue / Discussion

Continuation of the streaming-budget series #1576, #1598, #1601, #1611.

Additional Information

RTX 3060 12 GB, --offload-to-cpu --stream-layers --max-vram -1:

Workload	Before	After
SDXL bf16 1152x896 batch=2 8 steps	21.6 s	20.9 s
Z-Image bf16 1024x688 batch=2 9 steps	138 s	98 s

SDXL stays in the full-budget path so the plan is unchanged (1 merged segment, whole UNet resident). Z-Image takes the T/4 path, the planner produces 9-11 merged segments instead of 2, and chunk-K grows from ~0 to ~3.5 GB, dropping per-step H2D enough for a ~30% wallclock win.

We tested T/2, T/3, T/5, T/6, T/8 on the Z-Image workload; T/4 is the empirical optimum on the test hardware (smaller fractions trade chunk-K headroom for per-dispatch overhead and regress).

Checklist

I have read and confirmed this PR follows the contribution guidelines.

…ed-segments

fszontagh added 3 commits June 6, 2026 17:16

perf: cap planner budget when model dwarfs the streaming budget

2c830f5

Merge remote-tracking branch 'upstream/master' into perf/smaller-merg…

f429272

…ed-segments

Merge remote-tracking branch 'upstream/master' into perf/smaller-merg…

33a84ba

…ed-segments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: cap planner budget when model dwarfs the streaming budget#1612

perf: cap planner budget when model dwarfs the streaming budget#1612
fszontagh wants to merge 3 commits into
leejet:masterfrom
fszontagh:perf/smaller-merged-segments

fszontagh commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fszontagh commented Jun 6, 2026

Summary

Related Issue / Discussion

Additional Information

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant