Skip to content

perf: cap planner budget when model dwarfs the streaming budget#1612

Open
fszontagh wants to merge 3 commits into
leejet:masterfrom
fszontagh:perf/smaller-merged-segments
Open

perf: cap planner budget when model dwarfs the streaming budget#1612
fszontagh wants to merge 3 commits into
leejet:masterfrom
fszontagh:perf/smaller-merged-segments

Conversation

@fszontagh
Copy link
Copy Markdown
Contributor

Summary

When the model is much bigger than --max-vram, the planner currently merges the base segments into 1-2 huge merged segments. The follow-up worst_merged_segment_footprint reservation in compute_streaming_segments then leaves almost nothing for chunk-K residency, so on a model like Z-Image bf16 (11.7 GB on a 12 GB GPU) chunk-K stays near 0 and every sampling step re-uploads the full model.

When the model fits comfortably in the budget the current behaviour is already optimal: one big merged segment, chunk-K covers the whole model, no per-step H2D.

This PR caps the budget passed to resolve_plan at a quarter of the streaming budget when total_params_bytes > 0.75 * effective_budget, otherwise it passes the full budget. Smaller merged segments shrink the worst_merged_segment_footprint reservation downstream, which frees enough residency budget for a meaningful chunk-K. Small/quantized models are unaffected.

Quantization-aware: total_params_bytes is summed via ggml_nbytes, so a Q8/Q4 model is correctly identified as small relative to the budget.

Related Issue / Discussion

Continuation of the streaming-budget series #1576, #1598, #1601, #1611.

Additional Information

RTX 3060 12 GB, --offload-to-cpu --stream-layers --max-vram -1:

Workload Before After
SDXL bf16 1152x896 batch=2 8 steps 21.6 s 20.9 s
Z-Image bf16 1024x688 batch=2 9 steps 138 s 98 s

SDXL stays in the full-budget path so the plan is unchanged (1 merged segment, whole UNet resident). Z-Image takes the T/4 path, the planner produces 9-11 merged segments instead of 2, and chunk-K grows from ~0 to ~3.5 GB, dropping per-step H2D enough for a ~30% wallclock win.

We tested T/2, T/3, T/5, T/6, T/8 on the Z-Image workload; T/4 is the empirical optimum on the test hardware (smaller fractions trade chunk-K headroom for per-dispatch overhead and regress).

Checklist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant