Pr 44 device map fix by ctrl-gaurav · Pull Request #81 · ctrl-gaurav/effGen

ctrl-gaurav · 2026-06-09T00:53:25Z

Summary

Make the Transformers fallback engine survive broken device_map="auto" sharding. On some multi-GPU systems, device_map="auto" shards a model across GPUs in a way that produces invalid logits, which then triggers a CUDA device-side assert during sampling and poisons the GPU context for the rest of the process. This change detects that situation and automatically reloads the model pinned to GPU 0 (device_map={"": 0}) before sampling can corrupt CUDA, with a fallback retry path if an assert still slips through.

Merged from PR #44 (Aafiya-H) into pr-44-device-map-fix with the transformers_engine.py conflict resolved — quiet model loading, debug-level logging, and BPE-safe decoding from main were preserved alongside the new device-map recovery logic.

Changes

effgen/models/transformers_engine.py
- Refactor model loading into helpers: _assemble_model_kwargs, _from_pretrained_with_flash_fallback, _load_model_weights, _drop_model_weights (quiet loading + Flash-Attention fallback preserved).
- Add device_map="auto" recovery: probe logits via a short forward pass (_probe_auto_device_map_logits) and, if they look invalid, reload pinned to GPU 0 before sampling (_ensure_device_map_viable_before_sampling, _apply_cuda_device_map_pin_fallback, _effective_device_map).
- Detect CUDA device-side asserts (_is_cuda_device_side_assert) and retry generation once after pinning to GPU 0 (_maybe_retry_after_cuda_assert); covers generate, generate_batch, and generate_stream. Raises a clear "restart the process / set CUDA_VISIBLE_DEVICES=0" error when the context is already poisoned.
- Reset _pin_device_map_for_cuda / _retrying_after_cuda_assert state on unload(), and guard torch.cuda.empty_cache().
examples/basic/calculator_agent.py — tear down the math agent in a finally block (agent.close() before model.unload()) so demo/interactive runs exit cleanly without the "Agent was garbage-collected without calling close()" warning, even when generation fails mid-run.
CHANGELOG.md — documented the calculator-agent cleanup fix.
test.py — ⚠️ root-level scratch/demo script (loads Qwen2.5-3B and runs one calculation). Recommend removing before merge — it is not part of the test suite and lives outside tests/.

Testing

Unit tests pass: pytest tests/unit/ -v --no-cov
Integration tests pass (if applicable)
CHANGELOG.md updated
Manual: multi-GPU run where device_map="auto" previously asserted now falls back to GPU 0 and generates successfully

Type of Change

Bug fix
New feature
Breaking change
Documentation update

…breaks sampling, fix(examples): close math agent before unload device_map=auto on multi-GPU nodes produced invalid logits and CUDA multinomial asserts. Probe logits after load and pin to {: 0} before sampling; retry once on assert with a restart hint if the context is poisoned. Refactor model reload paths and harden calculator_agent cleanup in finally.

Aafiya-H and others added 4 commits May 21, 2026 17:51

merge branch 'main' of https://github.com/Aafiya-H/effGen

01caaa1

Merge branch 'observability' into fixing_side_assert

20c9c3a

Merge PR #44 (device-map fix), resolve transformers_engine conflict

29feeb7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pr 44 device map fix#81

Pr 44 device map fix#81
ctrl-gaurav wants to merge 4 commits into
mainfrom
pr-44-device-map-fix

ctrl-gaurav commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ctrl-gaurav commented Jun 9, 2026

Summary

Changes

Testing

Type of Change

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants