Skip to content

feat(e2e): add native-host scheduler stress harness#314

Open
sgopinath1 wants to merge 3 commits into
ROCm:mainfrom
sgopinath1:stress_tests
Open

feat(e2e): add native-host scheduler stress harness#314
sgopinath1 wants to merge 3 commits into
ROCm:mainfrom
sgopinath1:stress_tests

Conversation

@sgopinath1

Copy link
Copy Markdown
Collaborator

Summary

  • Adds tests/e2e/stress_harness/: native-host stress helpers that submit many single-CPU batch jobs, wait for the default squeue view to drain, then sample scontrol show job for queue wait / run / turnaround percentiles.
  • Submit and latency work run on the controller via uploaded bash drivers and xargs -P, so the test runner uses a small number of Paramiko exec calls instead of one SSH channel per sbatch / scontrol.
  • New pytest entry tests/e2e/native_host/test_scheduler_stress.py with @pytest.mark.stress; documents env vars (SPUR_STRESS_TIERS, SPUR_STRESS_PARALLEL, etc.) in tests/e2e/stress_harness/README.md and docs/developer/building.rst.
  • Bare-metal E2E workflow runs pytest …/native_host/ -m "not stress" so default CI stays unchanged; stress is opt-in via -m stress.
  • spurd: completion log line includes the controller URL for easier correlation when debugging multi-controller setups.

Copilot AI review requested due to automatic review settings June 17, 2026 12:30
@codecov-commenter

codecov-commenter commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #314      +/-   ##
==========================================
- Coverage   65.11%   65.11%   -0.01%     
==========================================
  Files         121      121              
  Lines       32346    32346              
==========================================
- Hits        21062    21059       -3     
- Misses      11284    11287       +3     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in native-host scheduler stress E2E harness to generate load via controller-side parallel sbatch, wait for the queue to drain, then compute latency percentiles from scontrol show job timestamps—while keeping default CI unchanged.

Changes:

  • Introduces tests/e2e/stress_harness/ (Python + uploaded bash drivers) and a new @pytest.mark.stress test tests/e2e/native_host/test_scheduler_stress.py.
  • Documents stress usage and env vars in tests/e2e/stress_harness/README.md and docs/developer/building.rst; registers the new pytest marker.
  • Updates bare-metal E2E workflow to exclude stress by default; enhances spurd completion logging with controller URL.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/e2e/stress_harness/README.md Documents how the stress harness runs and its environment variables.
tests/e2e/stress_harness/harness.py Implements tier execution, remote submit/latency drivers, and report formatting.
tests/e2e/stress_harness/init.py Exposes harness helpers as a small library module.
tests/e2e/README.md Adds stress harness mention and invocation example.
tests/e2e/pytest.ini Adds stress marker metadata.
tests/e2e/native_host/test_scheduler_stress.py New opt-in stress test driving the harness and printing a summary report.
docs/developer/building.rst Adds developer docs for running the stress test and tuning env vars.
crates/spurd/src/agent_server.rs Adds controller URL to completion-report log line.
.github/workflows/e2e.yml Excludes @pytest.mark.stress from default bare-metal E2E workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/e2e/stress_harness/harness.py
Comment thread tests/e2e/stress_harness/harness.py Outdated
Comment thread tests/e2e/native_host/test_scheduler_stress.py
Comment thread tests/e2e/native_host/test_scheduler_stress.py
Add stress_harness + pytest stress marker: remote xargs drivers on the
controller avoid Paramiko sbatch/scontrol fan-out; drain uses default squeue;
emit a rocm-style markdown summary to stdout.
Align printed harness location with PR ROCm#316 layout (native_host/e2e).
Comment on lines +342 to +347
info!(
job_id,
exit_code,
controller = %url,
"reported completion to controller"
);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is out of this PR's scope.

@shiv-tyagi

Copy link
Copy Markdown
Member

Maybe we can also call these performace tests instead of just stress tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants