Skip to content

[ci] run sampler_test in-process (drop from SUBPROC_TEST_PATTERN)#555

Open
tiankongdeguiji wants to merge 1 commit into
alibaba:masterfrom
tiankongdeguiji:subproc-test-pattern-cleanup
Open

[ci] run sampler_test in-process (drop from SUBPROC_TEST_PATTERN)#555
tiankongdeguiji wants to merge 1 commit into
alibaba:masterfrom
tiankongdeguiji:subproc-test-pattern-cleanup

Conversation

@tiankongdeguiji

Copy link
Copy Markdown
Collaborator

What

Move sampler_test to the in-process test group (drop .sampler_test. from SUBPROC_TEST_PATTERN). All other entries stay.

Why

SUBPROC_TEST_PATTERN runs each matching test in a fresh python -m subprocess. For the graphlearn-sampler tests this was added (#76) to dodge graphlearn's module-global SERVER_LAUNCHED guard, which raises duplicate server launch detected when launch_server runs twice in one process.

Since #554 tzrec owns the server launch (inlined, without touching graphlearn's SERVER_LAUNCHED), so that guard no longer fires. But removal has to be decided per entry — what matters is where the server is launched, since that's where the new liveness watchdog's os._exit runs:

entry isolation reason still needed?
sampler_test dup-launch guard — but launches the server only in forked _sampler_worker children No → in-process
dataset_test launches the server in the main process → watchdog could os._exit the whole runner Yes
odps_dataset_test main-process launch + distributed RANK/WORLD_SIZE + init_process_group; skipped w/o ODPS creds Yes
parquet_dataset_test distributed RANK/WORLD_SIZE/MASTER_PORT + forked ranks Yes
tdm.gen_tree mutates USE_HASH_NODE_ID + forks mp.Process clustering workers Yes
convert_easyrec_config_to_tzrec_config_test needs PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python set before protobuf import (_run_env) Yes

Only sampler_test is safe to de-isolate: its watchdog runs in the worker child (can't kill the runner) and its USE_HASH_NODE_ID mutation is reset in setUp.

Verification (local, GPU box with graphlearn)

  • _gather_test_cases now puts sampler_test's 22 cases in the in-process group and leaves dataset/odps/parquet/gen_tree/convert in the subprocess group.
  • sampler_test's 22 tests pass in-process — including the watchdog test, whose os._exit(1) kills only its worker child (runner survives), and a hash+int run together.
  • A/B confirming the removed root cause: graphlearn's launch_server twice in one process raises duplicate server launch detected; tzrec's inlined launch does not.
  • Even with the watchdog poll forced to 0.3 s, the main-process dataset_test runner survives the recurring graphlearn server-child crash (the sampler is GC'd → stop.set() before the crash) — confirming the watchdog hazard is real only for main-process launchers and is confined by their subprocess isolation.

🤖 Generated with Claude Code

https://claude.ai/code/session_012uPyFcxcLAdNpgFFCYSZ2f

sampler_test was isolated only for graphlearn's "duplicate server launch
detected" guard. That guard no longer applies now that tzrec owns the server
launch (alibaba#554 inlines it without graphlearn's SERVER_LAUNCHED global), and
sampler_test launches the server only inside forked _sampler_worker children,
so the liveness watchdog runs in the child and cannot os._exit the test
runner. Verified: the module's 22 tests pass in-process.

The other entries stay isolated for reasons independent of this change:
dataset_test/odps_dataset_test launch the server in the MAIN process (the
watchdog would os._exit the whole runner), odps/parquet also set distributed
RANK/WORLD_SIZE + init_process_group and fork ranks, tdm.gen_tree mutates
USE_HASH_NODE_ID and forks clustering workers, and convert_easyrec_config needs
PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION set before protobuf import.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_012uPyFcxcLAdNpgFFCYSZ2f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant