[PyTorch] Isolate CP pool worker stdout from NCCL/library banners by sudhakarsingh27 · Pull Request #3074 · NVIDIA/TransformerEngine

sudhakarsingh27 · 2026-06-02T00:09:29Z

Problem

The CP attention test pool worker (added in #2993) uses rank-0 stdout as a line-delimited JSON protocol channel between the worker and the parent pytest process. PoolWorker._submit_once reads one line and json.loads it.

When NCCL_DEBUG is set to VERSION or INFO, NCCL writes a banner line — NCCL version 2.x.y+cudaA.B\n — to stdout (fd 1), not stderr. That banner reaches the parent ahead of the first JSON response, json.loads raises, and because the cp_pool fixture is session-scoped and the pool is killed on first failure, every subsequent CP test in the file fails with:

AssertionError: pool worker JSON protocol broke:
JSONDecodeError('Expecting value: line 1 column 1 (char 0)');
line='NCCL version 2.30.5+cuda13.3\n'

CI runners that export NCCL_DEBUG=VERSION (e.g. the cuDNN nightly GB200 job) hit this on all ~200 non-skipped test_cp_with_flash_attention / test_cp_with_fused_attention cases. With NCCL_DEBUG unset the banner isn't emitted, so the suite passes — which is why the regression only shows up on those runners.

The existing mitigation only redirected non-rank-0 stdout to /dev/null, so rank 0's own banner still corrupted the stream.

Fix

Worker — run_attention_with_cp_pool.py: before import torch, dup() rank-0's original fd 1 into a private stream (_JSON_STDOUT) reserved for the JSON protocol, then dup2(2, 1) so fd 1 points at stderr. Any banner from NCCL / cuBLAS / cuDNN / torch — Python-level or C-level, import-time or runtime — now lands on stderr (still drained into CI logs by the parent's stderr thread) instead of the protocol pipe. Responses are written through _JSON_STDOUT. The redirect runs at module scope, before import torch, so even import-time/C-level writes are covered.

The pre-existing _silence_non_rank0_stdout (rank>0 → /dev/null) is kept, so between the two redirects every rank's fd 1 is diverted away from the pipe. The asymmetry is intentional and documented in the code: rank 0 keeps a dup of fd 1 for the JSON channel and sends the rest to stderr (preserved in CI logs); rank>0 has nothing to preserve, so it goes to /dev/null.

Parent — test_attention_with_cp.py: comment-only update. The read path is unchanged (single select + readline + json.loads), since with both redirects in place the pipe carries only rank-0's JSON line. A non-JSON line is still turned into a clear AssertionError ("pool worker JSON protocol broke") rather than a raw JSONDecodeError.

Why fd-level isolation rather than a stdout sentinel/skip

An earlier revision prefixed responses with a [CP_POOL_RESP] sentinel and had the parent skip non-protocol lines. That was dropped after measuring it: with the rank-0 fd redirect in place, instrumenting the parent to count non-sentinel lines on the pipe under NCCL_DEBUG=INFO (max verbosity) across multiple cases yields zero — the torchrun agent logs to stderr, not this stdout pipe, so the sentinel filtered nothing. fd isolation also behaves better under NCCL_DEBUG=INFO: the (large) banner volume goes to stderr instead of forcing the parent to read+skip hundreds of lines within the response deadline. Net result is a smaller, single-mechanism fix.

Validation

8×H100, TE built from this commit, with the NCCL banner confirmed emitted in every run (grep "NCCL version"):

Scenario	Result
`NCCL_DEBUG=VERSION`, flash p2p + all_gather + a2a (one session, pool reused across the banner)	✅ 3 passed
`NCCL_DEBUG=INFO` (verbose), flash p2p + fused p2p	✅ 2 passed
`NCCL_DEBUG=VERSION`, fused p2p	✅ 1 passed
`NCCL_DEBUG` unset (control)	✅ 1 passed
L1 det/non-det parallel suite (`qa/L1_pytorch_distributed_unittest/test.sh` pattern, `NCCL_DEBUG=VERSION`)	✅ non-det 45 passed / det 33 passed, 0 protocol breaks
Without this fix, `NCCL_DEBUG=VERSION`	❌ `pool worker JSON protocol broke` (regression reproduced)

Type of change

Bug fix (test infrastructure; no library/runtime code change)

Checklist

Contributing guidelines followed
Changes commented where non-obvious
No new warnings
Existing test suite serves as input + validation
Existing tests pass locally (8×H100, NCCL_DEBUG=VERSION/INFO/unset; L1 det+non-det)

greptile-apps · 2026-06-02T00:12:40Z

Greptile Summary

This PR fixes a CI regression where NCCL_DEBUG=VERSION/INFO caused NCCL banner lines to corrupt the line-delimited JSON protocol channel between the CP pool worker (rank-0 stdout) and the parent pytest process. The fix is a two-layer defense applied before importing torch: an fd-level dup of the original stdout is reserved exclusively for JSON (_JSON_STDOUT), and fd 1 is redirected to stderr so all C-level and Python-level library writes during import or runtime land on stderr instead of the protocol pipe.

Worker (run_attention_with_cp_pool.py): Adds _redirect_rank0_stdout_to_stderr() called at module level before import torch; _send_response now writes through _JSON_STDOUT (the private dup) instead of sys.stdout.
Parent (test_attention_with_cp.py): Updates comments only — logic is unchanged since the earlier sentinel-based approach was already removed, leaving the parent to rely entirely on the fd-level isolation.

Confidence Score: 5/5

Safe to merge — test infrastructure only, no library or runtime code changed.

The fd-level isolation is correctly placed before import torch so even import-time C-level writes are captured. The _JSON_STDOUT dup is only used on rank 0, eliminating any null-dereference risk on other ranks. The parent-side changes are comment-only.

No files require special attention.

Important Files Changed

Filename	Overview
tests/pytorch/attention/run_attention_with_cp_pool.py	Adds fd-level stdout isolation for rank-0 before torch import; _send_response correctly routes through the preserved _JSON_STDOUT dup. Logic is sound and all rank/null guards are in place.
tests/pytorch/attention/test_attention_with_cp.py	Comment-only updates to _submit_once; no logic changes. Accurately reflects that isolation is now done at the fd level on the worker side.

Sequence Diagram

sequenceDiagram
    participant Parent as pytest (parent)
    participant Pipe as stdout pipe (fd 1 original)
    participant Stderr as stderr
    participant Worker as torchrun worker (rank-0)

    Note over Worker: Module load BEFORE import torch
    Worker->>Pipe: os.dup(1) _JSON_STDOUT (private fd)
    Worker->>Stderr: os.dup2(2, 1) fd 1 now points at stderr
    Worker->>Stderr: "sys.stdout = sys.stderr"

    Note over Worker: import torch (NCCL init, banners)
    Worker->>Stderr: NCCL banner via fd 1 to stderr

    Note over Worker: main() dispatch loop
    Parent->>Worker: JSON request via stdin
    Worker->>Pipe: _JSON_STDOUT.write(json.dumps(payload))
    Parent->>Pipe: select() + readline()
    Parent->>Parent: json.loads(line) clean JSON

_{Reviews (4): Last reviewed commit: "Merge branch 'main' into sudhakars/cp_po..." | Re-trigger Greptile}

greptile-apps · 2026-06-02T00:12:45Z

+_RESP_PREFIX = "[CP_POOL_RESP] "
+
+
 def _send_response(rank: int, payload: dict) -> None:


_RESP_PREFIX duplicated across worker and parent without a shared source

The sentinel string "[CP_POOL_RESP] " is defined independently in both run_attention_with_cp_pool.py and test_attention_with_cp.py. If they ever drift (e.g. a typo during a rename), the parent will silently spin through every line until it times out, emitting a confusing "timed out" failure rather than a protocol-mismatch error. A brief comment calling out the coupling would make this risk explicit.

Suggested change

_RESP_PREFIX = "[CP_POOL_RESP] "

def _send_response(rank: int, payload: dict) -> None:

# NOTE: this string must stay identical to PoolWorker._RESP_PREFIX in

# test_attention_with_cp.py; the parent uses it to identify response lines.

_RESP_PREFIX = "[CP_POOL_RESP] "

def _send_response(rank: int, payload: dict) -> None:

Resolved by removing the sentinel entirely in e543ef89 — the rank-0 fd redirect keeps the pipe clean on its own (verified: 0 non-protocol lines under NCCL_DEBUG=INFO), so there's no longer a duplicated _RESP_PREFIX to keep in sync across the two files.

sudhakarsingh27

Targeted fix, mostly minimal. Main asks: drop the unreachable _JSON_STDOUT is None raise in _send_response and the dead re-entry guard in _redirect_rank0_stdout_to_stderr. The fd-redirect + sentinel pair is justified (worker process vs torchrun agent process — different coverage), but confirm the agent actually writes to this pipe; if that's speculative the sentinel alone suffices. fd-level ops are correct and the import-time placement is right. Comments are on the verbose side but explain real non-obvious intent, so fine to keep.

The CP attention test pool worker (PR NVIDIA#2993) uses rank-0 stdout as a line-delimited JSON protocol channel to the parent pytest process. When NCCL_DEBUG is set to VERSION or INFO, NCCL writes an "NCCL version ...\n" banner to stdout (fd 1); that banner reaches the parent ahead of the first JSON response, json.loads raises, and because the pool fixture is session-scoped and killed on first failure, every subsequent CP test in the file fails. CI runners that export NCCL_DEBUG hit this on all ~200 non-skipped cases. The prior mitigation only redirected non-rank-0 stdout to /dev/null, so rank 0's own banner still corrupted the stream. Fix it at the source: in the worker, before importing torch, dup rank-0's original fd 1 into a private stream reserved for JSON, then point fd 1 at stderr. Any banner from NCCL/cuBLAS/cuDNN/torch (Python or C level, import-time or runtime) now lands on stderr — still drained into CI logs by the parent's stderr thread — instead of the protocol pipe. Combined with the existing non-rank-0 /dev/null redirect, the pipe carries only rank-0 JSON, so the parent's single-line read needs no change. Validated on 8xH100 (TE built from this commit). With NCCL_DEBUG=VERSION and =INFO, flash (p2p/all_gather/a2a) and fused cases pass and zero non-protocol lines reach the pipe; without the fix the same cases fail with "pool worker JSON protocol broke". Control with NCCL_DEBUG unset also passes. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 · 2026-06-04T05:38:16Z

/te-ci pytorch L3

cyanguwa

Please make sure you test with NCCL_DEBUG turned on. Thanks for the PR!

sudhakarsingh27 self-assigned this Jun 2, 2026

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

sudhakarsingh27 commented Jun 2, 2026

View reviewed changes

sudhakarsingh27 force-pushed the sudhakars/cp_pool_stdout_isolation branch from 8332034 to e543ef8 Compare June 2, 2026 00:23

Merge branch 'main' into sudhakars/cp_pool_stdout_isolation

2a9133e

sudhakarsingh27 requested review from KshitijLakhani and cyanguwa June 4, 2026 15:27

Merge branch 'main' into sudhakars/cp_pool_stdout_isolation

ea95e99

cyanguwa approved these changes Jun 4, 2026

View reviewed changes

sudhakarsingh27 merged commit 0e58073 into NVIDIA:main Jun 4, 2026
12 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Isolate CP pool worker stdout from NCCL/library banners#3074

[PyTorch] Isolate CP pool worker stdout from NCCL/library banners#3074
sudhakarsingh27 merged 3 commits into
NVIDIA:mainfrom
sudhakarsingh27:sudhakars/cp_pool_stdout_isolation

sudhakarsingh27 commented Jun 2, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jun 2, 2026

Uh oh!

sudhakarsingh27 Jun 2, 2026

Uh oh!

sudhakarsingh27 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sudhakarsingh27 commented Jun 4, 2026

Uh oh!

cyanguwa left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		_RESP_PREFIX = "[CP_POOL_RESP] "


		def _send_response(rank: int, payload: dict) -> None:

Conversation

sudhakarsingh27 commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Why fd-level isolation rather than a stdout sentinel/skip

Validation

Type of change

Checklist

Uh oh!

greptile-apps Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

sudhakarsingh27 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

sudhakarsingh27 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sudhakarsingh27 commented Jun 4, 2026

Uh oh!

cyanguwa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sudhakarsingh27 commented Jun 2, 2026 •

edited

Loading

greptile-apps Bot commented Jun 2, 2026 •

edited

Loading