Skip to content

feat(spur): add burst-buffer staging subsystem with capacity pool and pending reasons#327

Merged
shiv-tyagi merged 4 commits into
ROCm:mainfrom
yansun1996:parity/burst-buffer
Jun 25, 2026
Merged

feat(spur): add burst-buffer staging subsystem with capacity pool and pending reasons#327
shiv-tyagi merged 4 commits into
ROCm:mainfrom
yansun1996:parity/burst-buffer

Conversation

@yansun1996

Copy link
Copy Markdown
Member

Draft — Group 4 of #307 / the scheduler+pool portion of #309 (burst-buffer staging subsystem).

⚠️ Stacked on #301 (parity/reason-code-vocab, unmerged). Until #301 lands, this diff includes #301's commits — review the feat(spur): add burst-buffer staging subsystem commit. Will rebase onto main once #301 merges.

What

Today --bb is only script-wrapping in the node agent — no pool, no scheduler awareness, and the BurstBuffer* reasons had no producer. This adds a real subsystem:

  • Capacity pool[burst_buffer] total_gb config; free capacity is derived (total − in-use), so it can never drift from config (mirrors how licenses work). A job holds capacity while Running/Suspended/Completing, or Pending-and-staging.
  • --bb capacity=NNN grammar (GB), parsed by a shared spur-core helper; coexists with the agent's stage_in:/stage_out: directives.
  • Per-job staging state machine None → Staging → Ready (on Job, rides the existing Raft snapshot — no new WAL op). Only Ready/no-BB jobs dispatch.
  • Two pending reasons, byte-exact Slurm strings, wired into BOTH pending_jobs() (drop) and tag_blocked_pending_reasons() (display) so they can't diverge:
    • BurstBufferResources — capacity shortage.
    • BurstBufferStageIn — capacity reserved, stage-in not yet complete.
  • Precedence: BB is last — Dependency → QoS → Reservation → Licenses → BurstBuffer — since staging happens immediately pre-dispatch.

Scope boundary

Real agent-side data movement is a documented follow-up behind drive_bb_stage_in() (the controller-side completion seam, called each scheduler cycle). The pool, both reasons, the staging state machine, and the scheduler hold are complete and tested; only the actual byte-copy round-trip is deferred.

Tests

Unit (spur-core): capacity parsing + pool alloc/free math. Integration (spurctld, 4 tests): over-capacity → BurstBufferResources; mid-stage → BurstBufferStageIn; dispatch only when Ready; capacity release on completion. Executor wrapping (spurd). REASON_VOCAB extended.

Live validation (Spur node)

  • Job requesting 200 GB against a 100 GB pool → PENDING Reason=BurstBufferResources
  • Job requesting 60 GB (within pool) → staged None→Staging→Ready and reached RUNNING ✅ (state machine end-to-end)
  • 2nd 60 GB job while the first holds 60 (only 40 free) → PENDING Reason=BurstBufferResources ✅ (derived capacity accounting)

Static: build/clippy/fmt clean; spur-core 207, spurctld 143, spurd 97 pass.

@codecov-commenter

codecov-commenter commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.18750% with 25 lines in your changes missing coverage. Please review.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #327      +/-   ##
==========================================
+ Coverage   66.90%   67.15%   +0.25%     
==========================================
  Files         127      128       +1     
  Lines       34212    34532     +320     
==========================================
+ Hits        22889    23189     +300     
- Misses      11323    11343      +20     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yansun1996 yansun1996 force-pushed the parity/burst-buffer branch from d7300fc to 92bab2e Compare June 24, 2026 22:39
@yansun1996 yansun1996 marked this pull request as ready for review June 24, 2026 22:39
Copilot AI review requested due to automatic review settings June 24, 2026 22:39
@yansun1996 yansun1996 requested a review from shiv-tyagi as a code owner June 24, 2026 22:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Comment thread crates/spurctld/src/scheduler_loop.rs Outdated
Comment thread crates/spurctld/src/cluster.rs
Comment thread crates/spurctld/src/cluster.rs

@shiv-tyagi shiv-tyagi left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posted comments. PTAL.

Comment thread crates/spur-core/src/config.rs
Comment thread crates/spurctld/src/limits_cache.rs Outdated
yansun1996 added a commit to yansun1996/spur that referenced this pull request Jun 25, 2026
…v-tyagi

- config.rs: insert blank line between IsolationConfig and BurstBufferConfig
  doc blocks so rustdoc doesn't merge them into one comment, losing
  IsolationConfig docs and attaching isolation prose to BurstBufferConfig
- limits_cache.rs: rename test to test_proto_to_qos_parses_all_limits
  (was "five" but covered six fields), add grp_tres round-trip assertion
  so a regression in opt_tres handling for grp_tres would not pass silently

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
yansun1996 and others added 3 commits June 25, 2026 16:15
… pending reasons

Group 4 of ROCm#307 / closes the scheduler+pool portion of ROCm#309. Model
cluster-wide burst-buffer capacity ([burst_buffer] total_gb), parse
--bb capacity=NNN, and add a per-job None->Staging->Ready state machine
wired into the scheduler.

- spur-core: new burst_buffer module (BbStageState, parse_capacity_gb);
  BurstBufferResources/BurstBufferStageIn reasons + emitters + vocab;
  bb_stage_state on Job (rides the Raft job snapshot); BurstBufferConfig.
- spurctld: derived capacity pool (total - in-use, never drifts from config,
  mirrors licenses), advance_bb_staging (reserve highest-priority-first),
  complete_bb_stage_in, drive_bb_stage_in seam; both pending_jobs() drop and
  tag_blocked_pending_reasons() display gates; precedence is last
  (Dep->QoS->Resv->Lic->BB) since staging is immediately pre-dispatch.
- spurd: --bb capacity= coexists with stage_in/stage_out wrapping.

Real agent-side data movement is a follow-up behind drive_bb_stage_in();
the state machine, pool, both reasons, and scheduler hold are complete.

Tests: capacity parsing + pool math (spur-core); 4 spurctld integration
tests (resource shortage -> BurstBufferResources, mid-stage ->
BurstBufferStageIn, dispatch only when Ready); executor wrapping (spurd);
REASON_VOCAB extended.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
- Remove the redundant collect-into-pairs-then-strip-priority pattern in
  advance_bb_staging candidates collection; collect job IDs directly and
  let the sort closure fetch priority from the already-held write guard.
- Drop what-narrating inline comments from the BB gate match arms and
  scheduler-loop call site; keep only the non-obvious why (double-count
  guard, drive-before-advance ordering, follow-up seam).

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Reorder scheduler loop: run drive_bb_stage_in()/advance_bb_staging()
before tag_blocked_pending_reasons() so BurstBufferStageIn reasons
reflect the staging state set in the same cycle, not the previous one.

Exclude deadlined jobs from advance_bb_staging() candidates to avoid
clobbering PendingReason::DeadLine with BurstBufferStageIn.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
yansun1996 added a commit to yansun1996/spur that referenced this pull request Jun 25, 2026
…v-tyagi

- config.rs: insert blank line between IsolationConfig and BurstBufferConfig
  doc blocks so rustdoc doesn't merge them into one comment, losing
  IsolationConfig docs and attaching isolation prose to BurstBufferConfig
- limits_cache.rs: rename test to test_proto_to_qos_parses_all_limits
  (was "five" but covered six fields), add grp_tres round-trip assertion
  so a regression in opt_tres handling for grp_tres would not pass silently

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
@yansun1996 yansun1996 force-pushed the parity/burst-buffer branch from 3afbcd2 to bc1e464 Compare June 25, 2026 16:17
@yansun1996 yansun1996 requested a review from shiv-tyagi June 25, 2026 16:18
…v-tyagi

- config.rs: insert blank line between IsolationConfig and BurstBufferConfig
  doc blocks so rustdoc doesn't merge them into one comment, losing
  IsolationConfig docs and attaching isolation prose to BurstBufferConfig
- limits_cache.rs: rename test to test_proto_to_qos_parses_all_limits
  (was "five" but covered six fields), add grp_tres round-trip assertion
  so a regression in opt_tres handling for grp_tres would not pass silently

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
@yansun1996 yansun1996 force-pushed the parity/burst-buffer branch from bc1e464 to a2070e0 Compare June 25, 2026 16:22
@shiv-tyagi shiv-tyagi merged commit fca5df6 into ROCm:main Jun 25, 2026
14 checks passed
@yansun1996 yansun1996 deleted the parity/burst-buffer branch June 25, 2026 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants