perf: cap planner budget when model dwarfs the streaming budget#1612
Open
fszontagh wants to merge 3 commits into
Open
perf: cap planner budget when model dwarfs the streaming budget#1612fszontagh wants to merge 3 commits into
fszontagh wants to merge 3 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When the model is much bigger than
--max-vram, the planner currently merges the base segments into 1-2 huge merged segments. The follow-upworst_merged_segment_footprintreservation incompute_streaming_segmentsthen leaves almost nothing for chunk-K residency, so on a model like Z-Image bf16 (11.7 GB on a 12 GB GPU) chunk-K stays near 0 and every sampling step re-uploads the full model.When the model fits comfortably in the budget the current behaviour is already optimal: one big merged segment, chunk-K covers the whole model, no per-step H2D.
This PR caps the budget passed to
resolve_planat a quarter of the streaming budget whentotal_params_bytes > 0.75 * effective_budget, otherwise it passes the full budget. Smaller merged segments shrink theworst_merged_segment_footprintreservation downstream, which frees enough residency budget for a meaningful chunk-K. Small/quantized models are unaffected.Quantization-aware:
total_params_bytesis summed viaggml_nbytes, so a Q8/Q4 model is correctly identified as small relative to the budget.Related Issue / Discussion
Continuation of the streaming-budget series #1576, #1598, #1601, #1611.
Additional Information
RTX 3060 12 GB,
--offload-to-cpu --stream-layers --max-vram -1:SDXL stays in the full-budget path so the plan is unchanged (1 merged segment, whole UNet resident). Z-Image takes the T/4 path, the planner produces 9-11 merged segments instead of 2, and chunk-K grows from ~0 to ~3.5 GB, dropping per-step H2D enough for a ~30% wallclock win.
We tested T/2, T/3, T/5, T/6, T/8 on the Z-Image workload; T/4 is the empirical optimum on the test hardware (smaller fractions trade chunk-K headroom for per-dispatch overhead and regress).
Checklist