Enable Qwen3.5-MoE AOTI export on memory-constrained GPUs#19205
Enable Qwen3.5-MoE AOTI export on memory-constrained GPUs#19205
Conversation
## Summary
Enable end-to-end ExecuTorch export of large models (e.g. Qwen3.5-35B-A3B
HQQ-INT4, ~18 GB of weights) under tight GPU memory budgets such as the
24 GB cap of consumer cards (RTX 4090 / 3090 / etc.) using the CUDA AOTI
backend.
Out of the box, calling `torch._inductor.aot_compile` on this model on a
24 GB-capped GPU OOMs in two distinct places:
1. **`_unlift_graph` clones every mutated buffer onto the model's target
device.** After `move_to_device_pass(...)` that target is CUDA, so we
end up with a transient ~18 GB GPU clone of the model weights on top
of the live model — instant OOM.
2. **Inductor / Triton internal caches keep multi-GB worth of CUDA
tensors alive between method compilations.** When ExecuTorch lowers a
multi-method export (e.g. decode + prefill) those leftovers stack up,
so the second method's compile starts from a half-full GPU and OOMs
again under the 24 GB cap.
This diff workarounds both issues in `CudaBackend` only — no changes to
PyTorch core, no impact on Metal / other AOTI backends.
## What this diff does
`backends/cuda/cuda_backend.py`:
1. **`_compile_time_cpu_clones(target_device)`** — context manager that
wraps `torch._inductor.compile_fx.clone_preserve_strides` so the
buffer clones produced by `_unlift_graph` land on CPU instead of the
target device. The wrap is **frame-discriminated**
(`sys._getframe(1).f_code.co_name == "_unlift_graph"`) so it does
*not* affect `triton_heuristics.py:1101`, which re-imports the same
symbol for autotune benchmark inputs that legitimately must stay on
GPU. It also wraps `CppWrapperCpu.codegen_device` so the generated
`constants_info_[i].device_type` still points at the real model
target device (e.g. cuda), preventing a mixed-device runtime error
when the constants are loaded back at inference time.
2. **`get_extra_aoti_compile_context_manager()`** — chains the existing
SDPA-MATH manager with `_compile_time_cpu_clones` via an
`ExitStack`, so both fire around the `torch._inductor.aot_compile`
call in `AotiBackend.preprocess`.
3. **`preprocess_multimethod()`** — overrides the base implementation
with a CUDA-specific cleanup loop that runs after every method
compile. It walks `gc.get_objects()`, finds every live CUDA tensor,
and calls `untyped_storage().resize_(0)` on it. This is how we
release ~18 GB of stale Inductor / Triton cache leftovers between
`decode` and `prefill` compiles. We need the in-place
`resize_(0)` (rather than `del + gc.collect()`) because the cache
still holds Python references — only forcibly emptying the storage
reclaims the GPU memory. Other AOTI backends (Metal/MPS) inherit the
default no-cleanup base implementation and are unaffected.
## Verified behavior
- `python -m executorch.examples.models.qwen3_5_moe.export
--prequantized <hqq-int4-bundle> --backend cuda` succeeds end-to-end
with `torch.cuda.set_per_process_memory_fraction(0.3, 0)` on an
80 GB A100 (= 24 GB visible) — peak GPU usage during compile stays at
~19 GB.
- Both `[CLEANUP]` lines fire and report ~18.29 GB freed per method.
- `qwen3_5_moe_runner` inference produces coherent text and matches
the perf of an unconstrained-VRAM export within measurement noise
(1903 tok/s prefill, 160 tok/s decode on A100 with `--cuda_graph=true`,
571-token prompt + 128 generated, GPU peak 18 GB).
## What should eventually move upstream
The three workarounds here all paper over real PyTorch issues that
deserve a proper fix in core:
1. **`_unlift_graph` cloning on the target device.** Cloning lifted
buffers onto whatever device they happen to live on is not free —
for large models the clone alone OOMs. Inductor should either
stage the clone on CPU explicitly or expose an option to do so.
Today we have to monkey-patch `clone_preserve_strides` *and* the
wrapper's device codegen to compensate; both could be replaced by a
first-class API such as `aot_compile(..., clone_buffers_on="cpu")`
plus an internal "original device of constant" record so the C++
wrapper writes the right `constants_info_`.
2. **Inductor / Triton caches leak compile-time CUDA tensors.** After
`aot_compile` returns the `.so` and `.ptd` are written, the result
bytes are in our hands, and every CUDA tensor still alive is by
definition stale. Today we have to walk `gc.get_objects()` and
manually release each storage. Inductor should drain its own caches
(`PyCodeCache`, `CompiledFxGraph`, `CachingAutotuner`, …) at the end
of an `aot_compile` call, or at least expose a
`torch._inductor.reset_compile_caches()` helper.
3. **`wrap_triton` has no `mutates_args` parameter.** This is the
underlying reason `identify_mutated_tensors` exists at all: the
inner HOP for a Triton kernel call has to re-derive mutations from
TTIR because the user can't declare them. A future
`wrap_triton(kernel, mutates_args={"C"})` would let kernel authors
short-circuit the TTIR analysis (and avoid the historical fallback
that marks every input as mutated, which still shows up in older
PyTorch / Triton combinations).
Once those land, the entire monkey-patch block in this file can be
deleted.
## Test Plan
- Manual: ran `python -m executorch.examples.models.qwen3_5_moe.export
--prequantized ... --backend cuda` with
`torch.cuda.set_per_process_memory_fraction(0.3, 0)` (24 GB cap on an
80 GB A100). Export succeeded; both `[CLEANUP]` lines fired; peak GPU
usage stayed under 24 GB.
- Manual: ran `qwen3_5_moe_runner` against the exported `.pte` /
`.ptd`. Inference produced coherent output, prefill 1903 tok/s,
decode 160 tok/s with `--cuda_graph=true`, GPU peak 18 GB.
- Unaffected backends: Metal / other AOTI backends inherit the default
`BackendDetails.preprocess_multimethod` (no cleanup) and are not
touched by this diff.
## Summary
Add a GPU memory regression guard so that the Qwen3.5 MoE export keeps
fitting on consumer-grade 24 GB GPUs (RTX 4090 / 3090 / A5000 …).
## What this diff does
1. `examples/models/qwen3_5_moe/export.py`
- Reset CUDA peak memory stats at the start of the CUDA backend setup.
- At the end of `main()`, when running with `--backend cuda`, print a
stable, machine-parseable marker line:
`EXPORT_GPU_PEAK_MEMORY_MB: <peak_in_MB>`
This makes the actual peak GPU memory consumed by the entire
load + quantize + lower pipeline visible to both humans and CI.
2. `.ci/scripts/export_model_artifact.sh` (qwen3_5_moe path)
- Tee the export output to a temp log.
- Grep the `EXPORT_GPU_PEAK_MEMORY_MB` marker and compare against
`EXPORT_GPU_PEAK_MB_LIMIT` (default 20480 MB = 20 GB; overridable
via env var).
- Fail the job with an explanatory error if the budget is exceeded,
so any future regression that reintroduces the ~18 GB unnecessary
GPU clone (or comparable leak) is caught at PR time rather than
silently breaking 24 GB-class GPUs.
## Notes
- Current measured peak with the CUDA backend memory fixes (see prior
commit on this branch) is ~18 GB, leaving ~2 GB headroom under the
20 GB limit. Without those fixes the peak shoots to ~37 GB and CI
will fail loudly.
- The threshold is intentionally tighter than the 24 GB physical cap
to leave room for measurement noise and small allocator overhead.
## Test Plan
- Manual: ran `python -m executorch.examples.models.qwen3_5_moe.export
--prequantized <hqq-int4-bundle> --backend cuda` and confirmed the
marker line is printed at the end with a sensible value (~18 GB).
- Manual: simulated CI gate logic locally with the marker line and
confirmed both the success path and the failure path (forced
threshold below the actual peak) behave as expected.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19205
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 18 New Failures, 5 Cancelled Jobs, 8 Unrelated FailuresAs of commit ef8aeb2 with merge base 8bda9ac ( NEW FAILURES - The following jobs have failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Goal
Make the Qwen3.5-35B-A3B HQQ-INT4 CUDA AOTI export viable on consumer-class
24 GB GPUs (RTX 4090 / 3090), so users without datacenter hardware
can run the model end-to-end.
Why it's needed
Out of the box the export OOMs on anything smaller than ~40 GB because the
AOTI compile pipeline (a) clones mutated buffers onto the target device on
top of the live model, and (b) leaks multi-GB of CUDA tensors between
method compiles via Inductor / Triton internal caches.
What this PR changes
All fixes are scoped to
executorch/backends/cuda/cuda_backend.py—no PyTorch core patches, no impact on Metal or other AOTI backends:
land on CPU instead of the target GPU, and patch the C++ wrapper's
device codegen so
constants_info_still points at the real modeldevice for the runtime.
get_extra_aoti_compile_context_manager()so they only apply duringaot_compileand revert cleanly afterwards.preprocess_multimethodfor CUDA only: between methodcompiles, walk live CUDA tensors and release their storage so stale
Inductor / Triton cache leftovers don't accumulate across
decode+prefill.GPU memory used, and CI fails if it exceeds 20 GB.
Verified
Future upstream work
The fixes here are workarounds for three underlying Inductor / PyTorch
issues that deserve proper upstream solutions:
aot_compileto clone lifted buffers on CPUwhile still recording the model's target device for the runtime.
aot_compilecall (or expose a public reset API).Once those land upstream, the entire workaround block in
cuda_backend.pycan be removed.