Skip to content

[PyTorch] Isolate CP pool worker stdout from NCCL/library banners#3074

Merged
sudhakarsingh27 merged 3 commits into
NVIDIA:mainfrom
sudhakarsingh27:sudhakars/cp_pool_stdout_isolation
Jun 4, 2026
Merged

[PyTorch] Isolate CP pool worker stdout from NCCL/library banners#3074
sudhakarsingh27 merged 3 commits into
NVIDIA:mainfrom
sudhakarsingh27:sudhakars/cp_pool_stdout_isolation

Conversation

@sudhakarsingh27
Copy link
Copy Markdown
Member

@sudhakarsingh27 sudhakarsingh27 commented Jun 2, 2026

Problem

The CP attention test pool worker (added in #2993) uses rank-0 stdout as a line-delimited JSON protocol channel between the worker and the parent pytest process. PoolWorker._submit_once reads one line and json.loads it.

When NCCL_DEBUG is set to VERSION or INFO, NCCL writes a banner line — NCCL version 2.x.y+cudaA.B\n — to stdout (fd 1), not stderr. That banner reaches the parent ahead of the first JSON response, json.loads raises, and because the cp_pool fixture is session-scoped and the pool is killed on first failure, every subsequent CP test in the file fails with:

AssertionError: pool worker JSON protocol broke:
JSONDecodeError('Expecting value: line 1 column 1 (char 0)');
line='NCCL version 2.30.5+cuda13.3\n'

CI runners that export NCCL_DEBUG=VERSION (e.g. the cuDNN nightly GB200 job) hit this on all ~200 non-skipped test_cp_with_flash_attention / test_cp_with_fused_attention cases. With NCCL_DEBUG unset the banner isn't emitted, so the suite passes — which is why the regression only shows up on those runners.

The existing mitigation only redirected non-rank-0 stdout to /dev/null, so rank 0's own banner still corrupted the stream.

Fix

Worker — run_attention_with_cp_pool.py: before import torch, dup() rank-0's original fd 1 into a private stream (_JSON_STDOUT) reserved for the JSON protocol, then dup2(2, 1) so fd 1 points at stderr. Any banner from NCCL / cuBLAS / cuDNN / torch — Python-level or C-level, import-time or runtime — now lands on stderr (still drained into CI logs by the parent's stderr thread) instead of the protocol pipe. Responses are written through _JSON_STDOUT. The redirect runs at module scope, before import torch, so even import-time/C-level writes are covered.

The pre-existing _silence_non_rank0_stdout (rank>0 → /dev/null) is kept, so between the two redirects every rank's fd 1 is diverted away from the pipe. The asymmetry is intentional and documented in the code: rank 0 keeps a dup of fd 1 for the JSON channel and sends the rest to stderr (preserved in CI logs); rank>0 has nothing to preserve, so it goes to /dev/null.

Parent — test_attention_with_cp.py: comment-only update. The read path is unchanged (single select + readline + json.loads), since with both redirects in place the pipe carries only rank-0's JSON line. A non-JSON line is still turned into a clear AssertionError ("pool worker JSON protocol broke") rather than a raw JSONDecodeError.

Why fd-level isolation rather than a stdout sentinel/skip

An earlier revision prefixed responses with a [CP_POOL_RESP] sentinel and had the parent skip non-protocol lines. That was dropped after measuring it: with the rank-0 fd redirect in place, instrumenting the parent to count non-sentinel lines on the pipe under NCCL_DEBUG=INFO (max verbosity) across multiple cases yields zero — the torchrun agent logs to stderr, not this stdout pipe, so the sentinel filtered nothing. fd isolation also behaves better under NCCL_DEBUG=INFO: the (large) banner volume goes to stderr instead of forcing the parent to read+skip hundreds of lines within the response deadline. Net result is a smaller, single-mechanism fix.

Validation

8×H100, TE built from this commit, with the NCCL banner confirmed emitted in every run (grep "NCCL version"):

Scenario Result
NCCL_DEBUG=VERSION, flash p2p + all_gather + a2a (one session, pool reused across the banner) ✅ 3 passed
NCCL_DEBUG=INFO (verbose), flash p2p + fused p2p ✅ 2 passed
NCCL_DEBUG=VERSION, fused p2p ✅ 1 passed
NCCL_DEBUG unset (control) ✅ 1 passed
L1 det/non-det parallel suite (qa/L1_pytorch_distributed_unittest/test.sh pattern, NCCL_DEBUG=VERSION) ✅ non-det 45 passed / det 33 passed, 0 protocol breaks
Without this fix, NCCL_DEBUG=VERSION pool worker JSON protocol broke (regression reproduced)

Type of change

  • Bug fix (test infrastructure; no library/runtime code change)

Checklist

  • Contributing guidelines followed
  • Changes commented where non-obvious
  • No new warnings
  • Existing test suite serves as input + validation
  • Existing tests pass locally (8×H100, NCCL_DEBUG=VERSION/INFO/unset; L1 det+non-det)

@sudhakarsingh27 sudhakarsingh27 self-assigned this Jun 2, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 2, 2026

Greptile Summary

This PR fixes a CI regression where NCCL_DEBUG=VERSION/INFO caused NCCL banner lines to corrupt the line-delimited JSON protocol channel between the CP pool worker (rank-0 stdout) and the parent pytest process. The fix is a two-layer defense applied before importing torch: an fd-level dup of the original stdout is reserved exclusively for JSON (_JSON_STDOUT), and fd 1 is redirected to stderr so all C-level and Python-level library writes during import or runtime land on stderr instead of the protocol pipe.

  • Worker (run_attention_with_cp_pool.py): Adds _redirect_rank0_stdout_to_stderr() called at module level before import torch; _send_response now writes through _JSON_STDOUT (the private dup) instead of sys.stdout.
  • Parent (test_attention_with_cp.py): Updates comments only — logic is unchanged since the earlier sentinel-based approach was already removed, leaving the parent to rely entirely on the fd-level isolation.

Confidence Score: 5/5

Safe to merge — test infrastructure only, no library or runtime code changed.

The fd-level isolation is correctly placed before import torch so even import-time C-level writes are captured. The _JSON_STDOUT dup is only used on rank 0, eliminating any null-dereference risk on other ranks. The parent-side changes are comment-only.

No files require special attention.

Important Files Changed

Filename Overview
tests/pytorch/attention/run_attention_with_cp_pool.py Adds fd-level stdout isolation for rank-0 before torch import; _send_response correctly routes through the preserved _JSON_STDOUT dup. Logic is sound and all rank/null guards are in place.
tests/pytorch/attention/test_attention_with_cp.py Comment-only updates to _submit_once; no logic changes. Accurately reflects that isolation is now done at the fd level on the worker side.

Sequence Diagram

sequenceDiagram
    participant Parent as pytest (parent)
    participant Pipe as stdout pipe (fd 1 original)
    participant Stderr as stderr
    participant Worker as torchrun worker (rank-0)

    Note over Worker: Module load BEFORE import torch
    Worker->>Pipe: os.dup(1) _JSON_STDOUT (private fd)
    Worker->>Stderr: os.dup2(2, 1) fd 1 now points at stderr
    Worker->>Stderr: "sys.stdout = sys.stderr"

    Note over Worker: import torch (NCCL init, banners)
    Worker->>Stderr: NCCL banner via fd 1 to stderr

    Note over Worker: main() dispatch loop
    Parent->>Worker: JSON request via stdin
    Worker->>Pipe: _JSON_STDOUT.write(json.dumps(payload))
    Parent->>Pipe: select() + readline()
    Parent->>Parent: json.loads(line) clean JSON
Loading

Reviews (4): Last reviewed commit: "Merge branch 'main' into sudhakars/cp_po..." | Re-trigger Greptile

Comment on lines 80 to 83
_RESP_PREFIX = "[CP_POOL_RESP] "


def _send_response(rank: int, payload: dict) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _RESP_PREFIX duplicated across worker and parent without a shared source

The sentinel string "[CP_POOL_RESP] " is defined independently in both run_attention_with_cp_pool.py and test_attention_with_cp.py. If they ever drift (e.g. a typo during a rename), the parent will silently spin through every line until it times out, emitting a confusing "timed out" failure rather than a protocol-mismatch error. A brief comment calling out the coupling would make this risk explicit.

Suggested change
_RESP_PREFIX = "[CP_POOL_RESP] "
def _send_response(rank: int, payload: dict) -> None:
# NOTE: this string must stay identical to PoolWorker._RESP_PREFIX in
# test_attention_with_cp.py; the parent uses it to identify response lines.
_RESP_PREFIX = "[CP_POOL_RESP] "
def _send_response(rank: int, payload: dict) -> None:

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by removing the sentinel entirely in e543ef89 — the rank-0 fd redirect keeps the pipe clean on its own (verified: 0 non-protocol lines under NCCL_DEBUG=INFO), so there's no longer a duplicated _RESP_PREFIX to keep in sync across the two files.

Copy link
Copy Markdown
Member Author

@sudhakarsingh27 sudhakarsingh27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Targeted fix, mostly minimal. Main asks: drop the unreachable _JSON_STDOUT is None raise in _send_response and the dead re-entry guard in _redirect_rank0_stdout_to_stderr. The fd-redirect + sentinel pair is justified (worker process vs torchrun agent process — different coverage), but confirm the agent actually writes to this pipe; if that's speculative the sentinel alone suffices. fd-level ops are correct and the import-time placement is right. Comments are on the verbose side but explain real non-obvious intent, so fine to keep.

Comment thread tests/pytorch/attention/run_attention_with_cp_pool.py Outdated
Comment thread tests/pytorch/attention/run_attention_with_cp_pool.py Outdated
Comment thread tests/pytorch/attention/run_attention_with_cp_pool.py
Comment thread tests/pytorch/attention/run_attention_with_cp_pool.py
Comment thread tests/pytorch/attention/test_attention_with_cp.py Outdated
The CP attention test pool worker (PR NVIDIA#2993) uses rank-0 stdout as a
line-delimited JSON protocol channel to the parent pytest process. When
NCCL_DEBUG is set to VERSION or INFO, NCCL writes an "NCCL version ...\n"
banner to stdout (fd 1); that banner reaches the parent ahead of the
first JSON response, json.loads raises, and because the pool fixture is
session-scoped and killed on first failure, every subsequent CP test in
the file fails. CI runners that export NCCL_DEBUG hit this on all ~200
non-skipped cases.

The prior mitigation only redirected non-rank-0 stdout to /dev/null, so
rank 0's own banner still corrupted the stream. Fix it at the source:
in the worker, before importing torch, dup rank-0's original fd 1 into a
private stream reserved for JSON, then point fd 1 at stderr. Any banner
from NCCL/cuBLAS/cuDNN/torch (Python or C level, import-time or runtime)
now lands on stderr — still drained into CI logs by the parent's stderr
thread — instead of the protocol pipe. Combined with the existing
non-rank-0 /dev/null redirect, the pipe carries only rank-0 JSON, so the
parent's single-line read needs no change.

Validated on 8xH100 (TE built from this commit). With NCCL_DEBUG=VERSION
and =INFO, flash (p2p/all_gather/a2a) and fused cases pass and zero
non-protocol lines reach the pipe; without the fix the same cases fail
with "pool worker JSON protocol broke". Control with NCCL_DEBUG unset
also passes.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
@sudhakarsingh27 sudhakarsingh27 force-pushed the sudhakars/cp_pool_stdout_isolation branch from 8332034 to e543ef8 Compare June 2, 2026 00:23
@sudhakarsingh27
Copy link
Copy Markdown
Member Author

/te-ci pytorch L3

Copy link
Copy Markdown
Collaborator

@cyanguwa cyanguwa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure you test with NCCL_DEBUG turned on. Thanks for the PR!

@sudhakarsingh27 sudhakarsingh27 merged commit 0e58073 into NVIDIA:main Jun 4, 2026
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants