feat(e2e): add native-host scheduler stress harness#314
Open
sgopinath1 wants to merge 3 commits into
Open
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #314 +/- ##
==========================================
- Coverage 65.11% 65.11% -0.01%
==========================================
Files 121 121
Lines 32346 32346
==========================================
- Hits 21062 21059 -3
- Misses 11284 11287 +3 🚀 New features to boost your workflow:
|
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an opt-in native-host scheduler stress E2E harness to generate load via controller-side parallel sbatch, wait for the queue to drain, then compute latency percentiles from scontrol show job timestamps—while keeping default CI unchanged.
Changes:
- Introduces
tests/e2e/stress_harness/(Python + uploaded bash drivers) and a new@pytest.mark.stresstesttests/e2e/native_host/test_scheduler_stress.py. - Documents stress usage and env vars in
tests/e2e/stress_harness/README.mdanddocs/developer/building.rst; registers the new pytest marker. - Updates bare-metal E2E workflow to exclude
stressby default; enhancesspurdcompletion logging with controller URL.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/e2e/stress_harness/README.md | Documents how the stress harness runs and its environment variables. |
| tests/e2e/stress_harness/harness.py | Implements tier execution, remote submit/latency drivers, and report formatting. |
| tests/e2e/stress_harness/init.py | Exposes harness helpers as a small library module. |
| tests/e2e/README.md | Adds stress harness mention and invocation example. |
| tests/e2e/pytest.ini | Adds stress marker metadata. |
| tests/e2e/native_host/test_scheduler_stress.py | New opt-in stress test driving the harness and printing a summary report. |
| docs/developer/building.rst | Adds developer docs for running the stress test and tuning env vars. |
| crates/spurd/src/agent_server.rs | Adds controller URL to completion-report log line. |
| .github/workflows/e2e.yml | Excludes @pytest.mark.stress from default bare-metal E2E workflow. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
4 tasks
Add stress_harness + pytest stress marker: remote xargs drivers on the controller avoid Paramiko sbatch/scontrol fan-out; drain uses default squeue; emit a rocm-style markdown summary to stdout.
Align printed harness location with PR ROCm#316 layout (native_host/e2e).
shiv-tyagi
requested changes
Jun 20, 2026
Comment on lines
+342
to
+347
| info!( | ||
| job_id, | ||
| exit_code, | ||
| controller = %url, | ||
| "reported completion to controller" | ||
| ); |
Member
There was a problem hiding this comment.
This is out of this PR's scope.
Member
|
Maybe we can also call these performace tests instead of just stress tests. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tests/e2e/stress_harness/: native-host stress helpers that submit many single-CPU batch jobs, wait for the defaultsqueueview to drain, then samplescontrol show jobfor queue wait / run / turnaround percentiles.xargs -P, so the test runner uses a small number of Paramikoexeccalls instead of one SSH channel persbatch/scontrol.tests/e2e/native_host/test_scheduler_stress.pywith@pytest.mark.stress; documents env vars (SPUR_STRESS_TIERS,SPUR_STRESS_PARALLEL, etc.) intests/e2e/stress_harness/README.mdanddocs/developer/building.rst.pytest …/native_host/ -m "not stress"so default CI stays unchanged; stress is opt-in via-m stress.spurd: completion log line includes the controller URL for easier correlation when debugging multi-controller setups.