[ci] run sampler_test in-process (drop from SUBPROC_TEST_PATTERN)#555
Open
tiankongdeguiji wants to merge 1 commit into
Open
[ci] run sampler_test in-process (drop from SUBPROC_TEST_PATTERN)#555tiankongdeguiji wants to merge 1 commit into
tiankongdeguiji wants to merge 1 commit into
Conversation
sampler_test was isolated only for graphlearn's "duplicate server launch detected" guard. That guard no longer applies now that tzrec owns the server launch (alibaba#554 inlines it without graphlearn's SERVER_LAUNCHED global), and sampler_test launches the server only inside forked _sampler_worker children, so the liveness watchdog runs in the child and cannot os._exit the test runner. Verified: the module's 22 tests pass in-process. The other entries stay isolated for reasons independent of this change: dataset_test/odps_dataset_test launch the server in the MAIN process (the watchdog would os._exit the whole runner), odps/parquet also set distributed RANK/WORLD_SIZE + init_process_group and fork ranks, tdm.gen_tree mutates USE_HASH_NODE_ID and forks clustering workers, and convert_easyrec_config needs PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION set before protobuf import. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_012uPyFcxcLAdNpgFFCYSZ2f
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Move
sampler_testto the in-process test group (drop.sampler_test.fromSUBPROC_TEST_PATTERN). All other entries stay.Why
SUBPROC_TEST_PATTERNruns each matching test in a freshpython -msubprocess. For the graphlearn-sampler tests this was added (#76) to dodge graphlearn's module-globalSERVER_LAUNCHEDguard, which raisesduplicate server launch detectedwhenlaunch_serverruns twice in one process.Since #554 tzrec owns the server launch (inlined, without touching graphlearn's
SERVER_LAUNCHED), so that guard no longer fires. But removal has to be decided per entry — what matters is where the server is launched, since that's where the new liveness watchdog'sos._exitruns:sampler_test_sampler_workerchildrendataset_testos._exitthe whole runnerodps_dataset_testRANK/WORLD_SIZE+init_process_group; skipped w/o ODPS credsparquet_dataset_testRANK/WORLD_SIZE/MASTER_PORT+ forked rankstdm.gen_treeUSE_HASH_NODE_ID+ forksmp.Processclustering workersconvert_easyrec_config_to_tzrec_config_testPROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=pythonset before protobuf import (_run_env)Only
sampler_testis safe to de-isolate: its watchdog runs in the worker child (can't kill the runner) and itsUSE_HASH_NODE_IDmutation is reset insetUp.Verification (local, GPU box with graphlearn)
_gather_test_casesnow putssampler_test's 22 cases in the in-process group and leavesdataset/odps/parquet/gen_tree/convertin the subprocess group.sampler_test's 22 tests pass in-process — including the watchdog test, whoseos._exit(1)kills only its worker child (runner survives), and a hash+int run together.launch_servertwice in one process raisesduplicate server launch detected; tzrec's inlined launch does not.dataset_testrunner survives the recurring graphlearn server-child crash (the sampler is GC'd →stop.set()before the crash) — confirming the watchdog hazard is real only for main-process launchers and is confined by their subprocess isolation.🤖 Generated with Claude Code
https://claude.ai/code/session_012uPyFcxcLAdNpgFFCYSZ2f