Enable Qwen3.5-MoE AOTI export on memory-constrained GPUs by Gasoonjia · Pull Request #19205 · pytorch/executorch

Gasoonjia · 2026-04-29T08:04:55Z

Goal

Make the Qwen3.5-35B-A3B HQQ-INT4 CUDA AOTI export viable on consumer-class
24 GB GPUs (RTX 4090 / 3090), so users without datacenter hardware
can run the model end-to-end.

Why it's needed

Out of the box the export OOMs on anything smaller than ~40 GB because the
AOTI compile pipeline (a) clones mutated buffers onto the target device on
top of the live model, and (b) leaks multi-GB of CUDA tensors between
method compiles via Inductor / Triton internal caches.

What this PR changes

All fixes are scoped to executorch/backends/cuda/cuda_backend.py —
no PyTorch core patches, no impact on Metal or other AOTI backends:

Monkey-patch Inductor's buffer cloning so the AOTI compile-time clones
land on CPU instead of the target GPU, and patch the C++ wrapper's
device codegen so constants_info_ still points at the real model
device for the runtime.
Wire those patches into the existing
get_extra_aoti_compile_context_manager() so they only apply during
aot_compile and revert cleanly afterwards.
Override preprocess_multimethod for CUDA only: between method
compiles, walk live CUDA tensors and release their storage so stale
Inductor / Triton cache leftovers don't accumulate across
decode + prefill.
Add a regression guard: the Qwen3.5 MoE export script prints the peak
GPU memory used, and CI fails if it exceeds 20 GB.

Verified

Export succeeds on an A100 capped to 20 GB; peak ~18 GB.
Export succeeds on real 4090 model, export
Inference perf matches an unconstrained export within noise.
Without the fixes, peak shoots to ~37 GB and the new CI gate fails.

Future upstream work

The fixes here are workarounds for three underlying Inductor / PyTorch
issues that deserve proper upstream solutions:

A first-class option for aot_compile to clone lifted buffers on CPU
while still recording the model's target device for the runtime.
Inductor / Triton should drain their own caches at the end of each
aot_compile call (or expose a public reset API).

Once those land upstream, the entire workaround block in
cuda_backend.py can be removed.

## Summary Enable end-to-end ExecuTorch export of large models (e.g. Qwen3.5-35B-A3B HQQ-INT4, ~18 GB of weights) under tight GPU memory budgets such as the 24 GB cap of consumer cards (RTX 4090 / 3090 / etc.) using the CUDA AOTI backend. Out of the box, calling `torch._inductor.aot_compile` on this model on a 24 GB-capped GPU OOMs in two distinct places: 1. **`_unlift_graph` clones every mutated buffer onto the model's target device.** After `move_to_device_pass(...)` that target is CUDA, so we end up with a transient ~18 GB GPU clone of the model weights on top of the live model — instant OOM. 2. **Inductor / Triton internal caches keep multi-GB worth of CUDA tensors alive between method compilations.** When ExecuTorch lowers a multi-method export (e.g. decode + prefill) those leftovers stack up, so the second method's compile starts from a half-full GPU and OOMs again under the 24 GB cap. This diff workarounds both issues in `CudaBackend` only — no changes to PyTorch core, no impact on Metal / other AOTI backends. ## What this diff does `backends/cuda/cuda_backend.py`: 1. **`_compile_time_cpu_clones(target_device)`** — context manager that wraps `torch._inductor.compile_fx.clone_preserve_strides` so the buffer clones produced by `_unlift_graph` land on CPU instead of the target device. The wrap is **frame-discriminated** (`sys._getframe(1).f_code.co_name == "_unlift_graph"`) so it does *not* affect `triton_heuristics.py:1101`, which re-imports the same symbol for autotune benchmark inputs that legitimately must stay on GPU. It also wraps `CppWrapperCpu.codegen_device` so the generated `constants_info_[i].device_type` still points at the real model target device (e.g. cuda), preventing a mixed-device runtime error when the constants are loaded back at inference time. 2. **`get_extra_aoti_compile_context_manager()`** — chains the existing SDPA-MATH manager with `_compile_time_cpu_clones` via an `ExitStack`, so both fire around the `torch._inductor.aot_compile` call in `AotiBackend.preprocess`. 3. **`preprocess_multimethod()`** — overrides the base implementation with a CUDA-specific cleanup loop that runs after every method compile. It walks `gc.get_objects()`, finds every live CUDA tensor, and calls `untyped_storage().resize_(0)` on it. This is how we release ~18 GB of stale Inductor / Triton cache leftovers between `decode` and `prefill` compiles. We need the in-place `resize_(0)` (rather than `del + gc.collect()`) because the cache still holds Python references — only forcibly emptying the storage reclaims the GPU memory. Other AOTI backends (Metal/MPS) inherit the default no-cleanup base implementation and are unaffected. ## Verified behavior - `python -m executorch.examples.models.qwen3_5_moe.export --prequantized <hqq-int4-bundle> --backend cuda` succeeds end-to-end with `torch.cuda.set_per_process_memory_fraction(0.3, 0)` on an 80 GB A100 (= 24 GB visible) — peak GPU usage during compile stays at ~19 GB. - Both `[CLEANUP]` lines fire and report ~18.29 GB freed per method. - `qwen3_5_moe_runner` inference produces coherent text and matches the perf of an unconstrained-VRAM export within measurement noise (1903 tok/s prefill, 160 tok/s decode on A100 with `--cuda_graph=true`, 571-token prompt + 128 generated, GPU peak 18 GB). ## What should eventually move upstream The three workarounds here all paper over real PyTorch issues that deserve a proper fix in core: 1. **`_unlift_graph` cloning on the target device.** Cloning lifted buffers onto whatever device they happen to live on is not free — for large models the clone alone OOMs. Inductor should either stage the clone on CPU explicitly or expose an option to do so. Today we have to monkey-patch `clone_preserve_strides` *and* the wrapper's device codegen to compensate; both could be replaced by a first-class API such as `aot_compile(..., clone_buffers_on="cpu")` plus an internal "original device of constant" record so the C++ wrapper writes the right `constants_info_`. 2. **Inductor / Triton caches leak compile-time CUDA tensors.** After `aot_compile` returns the `.so` and `.ptd` are written, the result bytes are in our hands, and every CUDA tensor still alive is by definition stale. Today we have to walk `gc.get_objects()` and manually release each storage. Inductor should drain its own caches (`PyCodeCache`, `CompiledFxGraph`, `CachingAutotuner`, …) at the end of an `aot_compile` call, or at least expose a `torch._inductor.reset_compile_caches()` helper. 3. **`wrap_triton` has no `mutates_args` parameter.** This is the underlying reason `identify_mutated_tensors` exists at all: the inner HOP for a Triton kernel call has to re-derive mutations from TTIR because the user can't declare them. A future `wrap_triton(kernel, mutates_args={"C"})` would let kernel authors short-circuit the TTIR analysis (and avoid the historical fallback that marks every input as mutated, which still shows up in older PyTorch / Triton combinations). Once those land, the entire monkey-patch block in this file can be deleted. ## Test Plan - Manual: ran `python -m executorch.examples.models.qwen3_5_moe.export --prequantized ... --backend cuda` with `torch.cuda.set_per_process_memory_fraction(0.3, 0)` (24 GB cap on an 80 GB A100). Export succeeded; both `[CLEANUP]` lines fired; peak GPU usage stayed under 24 GB. - Manual: ran `qwen3_5_moe_runner` against the exported `.pte` / `.ptd`. Inference produced coherent output, prefill 1903 tok/s, decode 160 tok/s with `--cuda_graph=true`, GPU peak 18 GB. - Unaffected backends: Metal / other AOTI backends inherit the default `BackendDetails.preprocess_multimethod` (no cleanup) and are not touched by this diff.

## Summary Add a GPU memory regression guard so that the Qwen3.5 MoE export keeps fitting on consumer-grade 24 GB GPUs (RTX 4090 / 3090 / A5000 …). ## What this diff does 1. `examples/models/qwen3_5_moe/export.py` - Reset CUDA peak memory stats at the start of the CUDA backend setup. - At the end of `main()`, when running with `--backend cuda`, print a stable, machine-parseable marker line: `EXPORT_GPU_PEAK_MEMORY_MB: <peak_in_MB>` This makes the actual peak GPU memory consumed by the entire load + quantize + lower pipeline visible to both humans and CI. 2. `.ci/scripts/export_model_artifact.sh` (qwen3_5_moe path) - Tee the export output to a temp log. - Grep the `EXPORT_GPU_PEAK_MEMORY_MB` marker and compare against `EXPORT_GPU_PEAK_MB_LIMIT` (default 20480 MB = 20 GB; overridable via env var). - Fail the job with an explanatory error if the budget is exceeded, so any future regression that reintroduces the ~18 GB unnecessary GPU clone (or comparable leak) is caught at PR time rather than silently breaking 24 GB-class GPUs. ## Notes - Current measured peak with the CUDA backend memory fixes (see prior commit on this branch) is ~18 GB, leaving ~2 GB headroom under the 20 GB limit. Without those fixes the peak shoots to ~37 GB and CI will fail loudly. - The threshold is intentionally tighter than the 24 GB physical cap to leave room for measurement noise and small allocator overhead. ## Test Plan - Manual: ran `python -m executorch.examples.models.qwen3_5_moe.export --prequantized <hqq-int4-bundle> --backend cuda` and confirmed the marker line is printed at the end with a sensible value (~18 GB). - Manual: simulated CI gate logic locally with the marker line and confirmed both the success path and the failure path (forced threshold below the actual peak) behave as expected.

pytorch-bot · 2026-04-29T08:04:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19205

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rolling out OSDC (ARC) runners on pull & trunk workflows in PyTorch main

❌ 18 New Failures, 5 Cancelled Jobs, 8 Unrelated Failures

As of commit ef8aeb2 with merge base 8bda9ac ():

NEW FAILURES - The following jobs have failed:

pull / test-llama-runner-linux (fp32, xnnpack+custom+quantize_kv, linux.arm64.2xlarge, executorch-ubuntu... / linux-job (gh)
RuntimeError: Command docker exec -t 351c7808461f1c335ce516d7d36bb6f6ecf3cef4f25ca8a4e20bbcbcc235fde3 /exec failed with exit code 1
Test CUDA Builds / export-model-cuda-artifact (google, gemma-3-4b-it, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 66d7192cce2ea61514dcb54e597e4ce1c6eba9205cb1272d6f41b0f2fb62976e /exec failed with exit code 1
Test CUDA Builds / export-model-cuda-artifact (google, gemma-3-4b-it, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t f50c8a689b2767681b4a68862490e7d5345c6a4c64a7a5a44d014127db7b532a /exec failed with exit code 1
Test CUDA Builds / export-model-cuda-artifact (mistralai, Voxtral-Mini-3B-2507, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t a08e93b6c86e7e1cf53dbd6df620730804dee6b1c89c288a5de839141d98f55d /exec failed with exit code 1
Test CUDA Builds / export-model-cuda-artifact (mistralai, Voxtral-Mini-3B-2507, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t 69b73b0c8b924354f1e57dd348bc13b379709995ca89295daf9f6656b02ded4a /exec failed with exit code 1
Test CUDA Builds / export-model-cuda-artifact (mistralai, Voxtral-Mini-3B-2507, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t ee7b789c241741b30e0dd7cf73b18824a321b1423216e612c86d7446a0c7f131 /exec failed with exit code 1
Test CUDA Builds / export-model-cuda-artifact (nvidia, parakeet-tdt, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 401ca1c2808e807dd5161f7cd3c857808e3e87d57d781468c2e42c8d00d36e92 /exec failed with exit code 1
Test CUDA Builds / export-model-cuda-artifact (nvidia, parakeet-tdt, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t f379793d8b375a6ea9cddffd216b93935699095b058b20e67cf091fe6f793b25 /exec failed with exit code 1
Test CUDA Builds / export-model-cuda-artifact (openai, whisper-large-v3-turbo, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t ab61590a540055ddf918be3a737c7f8a82a51c4a47b23a49eb7ed26bf78c1cf5 /exec failed with exit code 1
Test CUDA Builds / export-model-cuda-artifact (openai, whisper-large-v3-turbo, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t 1d76a93be945791263e8572d5d7aefe2f8fdb6dac94827b1bb312370aabf1d22 /exec failed with exit code 1
Test CUDA Builds / export-model-cuda-artifact (openai, whisper-small, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 33c9534cb2e64018ed25417bae9163d7dba3575110bc7df0fb0bb246d87e957a /exec failed with exit code 1
Test CUDA Builds / unittest-cuda / linux-job (gh)
[ FAILED ] AOTITorchCudaRandTest.RandBFloat16Basic
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (mistralai, Voxtral-Mini-3B-2507, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 69d23a41c9e95df3730553f97c97ef5eee8754381ec32c2d2bcba141ca825272 /exec failed with exit code 1
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (mistralai, Voxtral-Mini-3B-2507, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t 9b47618ae74952c2cffb6b1f25f9ce5dd53c1e1978b2215ee18f43a59b7c9894 /exec failed with exit code 1
trunk / test-arm-backend-zephyr (ethos-u55) / linux-job (gh)
RuntimeError: Command docker exec -t d336c58c3ed776c5ced5127da3142f7c80a50f3e1fda81f4d8a123ef2f3f606d /exec failed with exit code 1
trunk / test-arm-backend-zephyr (ethos-u85) / linux-job (gh)
RuntimeError: Command docker exec -t 19f008bbbd69721d2d8a9c3eca1b10cfd71d89b0ef9ba26090ea8e6cae6f2a69 /exec failed with exit code 1
trunk / test-models-linux-aarch64 (mv2, xnnpack-quantization-delegation, linux.arm64.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t 1171c949e8d80154a1b0fab36992cd11a995a0ff2d7c83a9bbe2fa93002fb740 /exec failed with exit code 1
trunk / test-models-macos-cpu (llama2, xnnpack-quantization-delegation) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

CANCELLED JOBS - The following jobs were cancelled. Please retry:

Test CUDA Builds / export-model-cuda-artifact (mistralai, Voxtral-Mini-4B-Realtime-2602, quantized-int4-tile-packed) / linux-job (gh)
Test Metal Backend / export-model-metal-artifact (mistralai, Voxtral-Mini-4B-Realtime-2602, quantized-int4-metal) / macos-job (gh)
trunk / test-qnn-optimum-model (fp32, pvt) / linux-job (gh)
##[error]The operation was canceled.
trunk / test-torchao-huggingface-checkpoints (lfm2_5_1_2b, linux.2xlarge, executorch-ubuntu-22.04-clang12... / linux-job (gh)
##[error]The operation was canceled.
trunk / test-torchao-huggingface-checkpoints (lfm2_5_1_2b, linux.arm64.2xlarge, executorch-ubuntu-22.04-g... / linux-job (gh)
##[error]The operation was canceled.

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

MLX / test-mlx-voxtral-realtime / test-mlx-voxtral-realtime (gh) (detected as infra flaky with no log or failing log classifier)
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (mistralai, Voxtral-Mini-4B-Realtime-2602, quantized-int4-tile... / linux-job (gh) (matched win rule in flaky-rules.json)
##[error]The operation was canceled.

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

MLX / test-mlx-qwen35-moe / test-mlx-qwen35-moe (gh) (trunk failure)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 1
pull / test-voxtral-realtime-xnnpack-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest / macos / macos-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
##[error]The operation was canceled.
Test Metal Backend / test-metal-qwen35-moe-tiny / macos-job (gh) (trunk failure)
/Users/ec2-user/runner/_work/executorch/executorch/pytorch/executorch/examples/models/qwen3_5_moe/main.cpp:65:16: error: use of undeclared identifier 'cudaSuccess'
trunk / unittest-release / macos / macos-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-04-29T08:06:02Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Gasoonjia added 3 commits April 29, 2026 00:34

lint

ef8aeb2

Gasoonjia requested a review from lucylq as a code owner April 29, 2026 08:04

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 29, 2026

Gasoonjia added ciflow/metal ciflow/cuda ciflow/mlx labels Apr 29, 2026

Gasoonjia requested review from digantdesai and mergennachin and removed request for lucylq April 29, 2026 20:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Qwen3.5-MoE AOTI export on memory-constrained GPUs#19205

Enable Qwen3.5-MoE AOTI export on memory-constrained GPUs#19205
Gasoonjia wants to merge 3 commits intomainfrom
cuda-aoti-low-mem-export

Gasoonjia commented Apr 29, 2026

Uh oh!

pytorch-bot Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Gasoonjia commented Apr 29, 2026

Goal

Why it's needed

What this PR changes

Verified

Future upstream work

Uh oh!

pytorch-bot Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19205

❗ 1 Active SEVs

❌ 18 New Failures, 5 Cancelled Jobs, 8 Unrelated Failures

Uh oh!

github-actions Bot commented Apr 29, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pytorch-bot Bot commented Apr 29, 2026 •

edited

Loading

This PR needs a `release notes:` label