feat(spur): add burst-buffer staging subsystem with capacity pool and pending reasons by yansun1996 · Pull Request #327 · ROCm/spur

yansun1996 · 2026-06-23T00:12:20Z

Draft — Group 4 of #307 / the scheduler+pool portion of #309 (burst-buffer staging subsystem).

⚠️ Stacked on #301 (parity/reason-code-vocab, unmerged). Until #301 lands, this diff includes #301's commits — review the feat(spur): add burst-buffer staging subsystem commit. Will rebase onto main once #301 merges.

What

Today --bb is only script-wrapping in the node agent — no pool, no scheduler awareness, and the BurstBuffer* reasons had no producer. This adds a real subsystem:

Capacity pool — [burst_buffer] total_gb config; free capacity is derived (total − in-use), so it can never drift from config (mirrors how licenses work). A job holds capacity while Running/Suspended/Completing, or Pending-and-staging.
--bb capacity=NNN grammar (GB), parsed by a shared spur-core helper; coexists with the agent's stage_in:/stage_out: directives.
Per-job staging state machine None → Staging → Ready (on Job, rides the existing Raft snapshot — no new WAL op). Only Ready/no-BB jobs dispatch.
Two pending reasons, byte-exact Slurm strings, wired into BOTH pending_jobs() (drop) and tag_blocked_pending_reasons() (display) so they can't diverge:
- BurstBufferResources — capacity shortage.
- BurstBufferStageIn — capacity reserved, stage-in not yet complete.
Precedence: BB is last — Dependency → QoS → Reservation → Licenses → BurstBuffer — since staging happens immediately pre-dispatch.

Scope boundary

Real agent-side data movement is a documented follow-up behind drive_bb_stage_in() (the controller-side completion seam, called each scheduler cycle). The pool, both reasons, the staging state machine, and the scheduler hold are complete and tested; only the actual byte-copy round-trip is deferred.

Tests

Unit (spur-core): capacity parsing + pool alloc/free math. Integration (spurctld, 4 tests): over-capacity → BurstBufferResources; mid-stage → BurstBufferStageIn; dispatch only when Ready; capacity release on completion. Executor wrapping (spurd). REASON_VOCAB extended.

Live validation (Spur node)

Job requesting 200 GB against a 100 GB pool → PENDING Reason=BurstBufferResources ✅
Job requesting 60 GB (within pool) → staged None→Staging→Ready and reached RUNNING ✅ (state machine end-to-end)
2nd 60 GB job while the first holds 60 (only 40 free) → PENDING Reason=BurstBufferResources ✅ (derived capacity accounting)

Static: build/clippy/fmt clean; spur-core 207, spurctld 143, spurd 97 pass.

codecov-commenter · 2026-06-23T00:16:22Z

Codecov Report

❌ Patch coverage is 92.18750% with 25 lines in your changes missing coverage. Please review.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #327      +/-   ##
==========================================
+ Coverage   66.90%   67.15%   +0.25%     
==========================================
  Files         127      128       +1     
  Lines       34212    34532     +320     
==========================================
+ Hits        22889    23189     +300     
- Misses      11323    11343      +20

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

shiv-tyagi

Posted comments. PTAL.

…v-tyagi - config.rs: insert blank line between IsolationConfig and BurstBufferConfig doc blocks so rustdoc doesn't merge them into one comment, losing IsolationConfig docs and attaching isolation prose to BurstBufferConfig - limits_cache.rs: rename test to test_proto_to_qos_parses_all_limits (was "five" but covered six fields), add grp_tres round-trip assertion so a regression in opt_tres handling for grp_tres would not pass silently Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

… pending reasons Group 4 of ROCm#307 / closes the scheduler+pool portion of ROCm#309. Model cluster-wide burst-buffer capacity ([burst_buffer] total_gb), parse --bb capacity=NNN, and add a per-job None->Staging->Ready state machine wired into the scheduler. - spur-core: new burst_buffer module (BbStageState, parse_capacity_gb); BurstBufferResources/BurstBufferStageIn reasons + emitters + vocab; bb_stage_state on Job (rides the Raft job snapshot); BurstBufferConfig. - spurctld: derived capacity pool (total - in-use, never drifts from config, mirrors licenses), advance_bb_staging (reserve highest-priority-first), complete_bb_stage_in, drive_bb_stage_in seam; both pending_jobs() drop and tag_blocked_pending_reasons() display gates; precedence is last (Dep->QoS->Resv->Lic->BB) since staging is immediately pre-dispatch. - spurd: --bb capacity= coexists with stage_in/stage_out wrapping. Real agent-side data movement is a follow-up behind drive_bb_stage_in(); the state machine, pool, both reasons, and scheduler hold are complete. Tests: capacity parsing + pool math (spur-core); 4 spurctld integration tests (resource shortage -> BurstBufferResources, mid-stage -> BurstBufferStageIn, dispatch only when Ready); executor wrapping (spurd); REASON_VOCAB extended. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

- Remove the redundant collect-into-pairs-then-strip-priority pattern in advance_bb_staging candidates collection; collect job IDs directly and let the sort closure fetch priority from the already-held write guard. - Drop what-narrating inline comments from the BB gate match arms and scheduler-loop call site; keep only the non-obvious why (double-count guard, drive-before-advance ordering, follow-up seam). Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Reorder scheduler loop: run drive_bb_stage_in()/advance_bb_staging() before tag_blocked_pending_reasons() so BurstBufferStageIn reasons reflect the staging state set in the same cycle, not the previous one. Exclude deadlined jobs from advance_bb_staging() candidates to avoid clobbering PendingReason::DeadLine with BurstBufferStageIn. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

…v-tyagi - config.rs: insert blank line between IsolationConfig and BurstBufferConfig doc blocks so rustdoc doesn't merge them into one comment, losing IsolationConfig docs and attaching isolation prose to BurstBufferConfig - limits_cache.rs: rename test to test_proto_to_qos_parses_all_limits (was "five" but covered six fields), add grp_tres round-trip assertion so a regression in opt_tres handling for grp_tres would not pass silently Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

yansun1996 force-pushed the parity/burst-buffer branch from d7300fc to 92bab2e Compare June 24, 2026 22:39

yansun1996 marked this pull request as ready for review June 24, 2026 22:39

yansun1996 requested review from powderluv and sgopinath1 as code owners June 24, 2026 22:39

Copilot AI review requested due to automatic review settings June 24, 2026 22:39

yansun1996 requested a review from shiv-tyagi as a code owner June 24, 2026 22:39

Copilot started reviewing on behalf of yansun1996 June 24, 2026 22:40 View session

Copilot AI reviewed Jun 24, 2026

yansun1996 force-pushed the parity/burst-buffer branch from 92bab2e to 672e53f Compare June 25, 2026 00:11

yansun1996 requested a review from Copilot June 25, 2026 00:12

Copilot started reviewing on behalf of yansun1996 June 25, 2026 00:13 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Comment thread crates/spurctld/src/scheduler_loop.rs Outdated

Comment thread crates/spurctld/src/cluster.rs

Comment thread crates/spurctld/src/cluster.rs

shiv-tyagi reviewed Jun 25, 2026

View reviewed changes

Comment thread crates/spur-core/src/config.rs

Comment thread crates/spurctld/src/limits_cache.rs Outdated

yansun1996 and others added 3 commits June 25, 2026 16:15

yansun1996 force-pushed the parity/burst-buffer branch from 3afbcd2 to bc1e464 Compare June 25, 2026 16:17

yansun1996 requested a review from shiv-tyagi June 25, 2026 16:18

yansun1996 force-pushed the parity/burst-buffer branch from bc1e464 to a2070e0 Compare June 25, 2026 16:22

shiv-tyagi approved these changes Jun 25, 2026

View reviewed changes

shiv-tyagi merged commit fca5df6 into ROCm:main Jun 25, 2026
14 checks passed

yansun1996 deleted the parity/burst-buffer branch June 25, 2026 17:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(spur): add burst-buffer staging subsystem with capacity pool and pending reasons#327

feat(spur): add burst-buffer staging subsystem with capacity pool and pending reasons#327
shiv-tyagi merged 4 commits into
ROCm:mainfrom
yansun1996:parity/burst-buffer

yansun1996 commented Jun 23, 2026

Uh oh!

codecov-commenter commented Jun 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shiv-tyagi left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

yansun1996 commented Jun 23, 2026

What

Scope boundary

Tests

Live validation (Spur node)

Uh oh!

codecov-commenter commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shiv-tyagi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jun 23, 2026 •

edited

Loading