Skip to content

Enable Qwen3.5-MoE AOTI export on memory-constrained GPUs#19205

Open
Gasoonjia wants to merge 3 commits intomainfrom
cuda-aoti-low-mem-export
Open

Enable Qwen3.5-MoE AOTI export on memory-constrained GPUs#19205
Gasoonjia wants to merge 3 commits intomainfrom
cuda-aoti-low-mem-export

Conversation

@Gasoonjia
Copy link
Copy Markdown
Contributor

Goal

Make the Qwen3.5-35B-A3B HQQ-INT4 CUDA AOTI export viable on consumer-class
24 GB GPUs (RTX 4090 / 3090), so users without datacenter hardware
can run the model end-to-end.

Why it's needed

Out of the box the export OOMs on anything smaller than ~40 GB because the
AOTI compile pipeline (a) clones mutated buffers onto the target device on
top of the live model, and (b) leaks multi-GB of CUDA tensors between
method compiles via Inductor / Triton internal caches.

What this PR changes

All fixes are scoped to executorch/backends/cuda/cuda_backend.py
no PyTorch core patches, no impact on Metal or other AOTI backends:

  1. Monkey-patch Inductor's buffer cloning so the AOTI compile-time clones
    land on CPU instead of the target GPU, and patch the C++ wrapper's
    device codegen so constants_info_ still points at the real model
    device for the runtime.
  2. Wire those patches into the existing
    get_extra_aoti_compile_context_manager() so they only apply during
    aot_compile and revert cleanly afterwards.
  3. Override preprocess_multimethod for CUDA only: between method
    compiles, walk live CUDA tensors and release their storage so stale
    Inductor / Triton cache leftovers don't accumulate across
    decode + prefill.
  4. Add a regression guard: the Qwen3.5 MoE export script prints the peak
    GPU memory used, and CI fails if it exceeds 20 GB.

Verified

  • Export succeeds on an A100 capped to 20 GB; peak ~18 GB.
  • Export succeeds on real 4090 model, export
  • Inference perf matches an unconstrained export within noise.
  • Without the fixes, peak shoots to ~37 GB and the new CI gate fails.

Future upstream work

The fixes here are workarounds for three underlying Inductor / PyTorch
issues that deserve proper upstream solutions:

  1. A first-class option for aot_compile to clone lifted buffers on CPU
    while still recording the model's target device for the runtime.
  2. Inductor / Triton should drain their own caches at the end of each
    aot_compile call (or expose a public reset API).

Once those land upstream, the entire workaround block in
cuda_backend.py can be removed.

## Summary

Enable end-to-end ExecuTorch export of large models (e.g. Qwen3.5-35B-A3B
HQQ-INT4, ~18 GB of weights) under tight GPU memory budgets such as the
24 GB cap of consumer cards (RTX 4090 / 3090 / etc.) using the CUDA AOTI
backend.

Out of the box, calling `torch._inductor.aot_compile` on this model on a
24 GB-capped GPU OOMs in two distinct places:

1. **`_unlift_graph` clones every mutated buffer onto the model's target
   device.** After `move_to_device_pass(...)` that target is CUDA, so we
   end up with a transient ~18 GB GPU clone of the model weights on top
   of the live model — instant OOM.
2. **Inductor / Triton internal caches keep multi-GB worth of CUDA
   tensors alive between method compilations.** When ExecuTorch lowers a
   multi-method export (e.g. decode + prefill) those leftovers stack up,
   so the second method's compile starts from a half-full GPU and OOMs
   again under the 24 GB cap.

This diff workarounds both issues in `CudaBackend` only — no changes to
PyTorch core, no impact on Metal / other AOTI backends.

## What this diff does

`backends/cuda/cuda_backend.py`:

1. **`_compile_time_cpu_clones(target_device)`** — context manager that
   wraps `torch._inductor.compile_fx.clone_preserve_strides` so the
   buffer clones produced by `_unlift_graph` land on CPU instead of the
   target device. The wrap is **frame-discriminated**
   (`sys._getframe(1).f_code.co_name == "_unlift_graph"`) so it does
   *not* affect `triton_heuristics.py:1101`, which re-imports the same
   symbol for autotune benchmark inputs that legitimately must stay on
   GPU. It also wraps `CppWrapperCpu.codegen_device` so the generated
   `constants_info_[i].device_type` still points at the real model
   target device (e.g. cuda), preventing a mixed-device runtime error
   when the constants are loaded back at inference time.

2. **`get_extra_aoti_compile_context_manager()`** — chains the existing
   SDPA-MATH manager with `_compile_time_cpu_clones` via an
   `ExitStack`, so both fire around the `torch._inductor.aot_compile`
   call in `AotiBackend.preprocess`.

3. **`preprocess_multimethod()`** — overrides the base implementation
   with a CUDA-specific cleanup loop that runs after every method
   compile. It walks `gc.get_objects()`, finds every live CUDA tensor,
   and calls `untyped_storage().resize_(0)` on it. This is how we
   release ~18 GB of stale Inductor / Triton cache leftovers between
   `decode` and `prefill` compiles. We need the in-place
   `resize_(0)` (rather than `del + gc.collect()`) because the cache
   still holds Python references — only forcibly emptying the storage
   reclaims the GPU memory. Other AOTI backends (Metal/MPS) inherit the
   default no-cleanup base implementation and are unaffected.

## Verified behavior

- `python -m executorch.examples.models.qwen3_5_moe.export
  --prequantized <hqq-int4-bundle> --backend cuda` succeeds end-to-end
  with `torch.cuda.set_per_process_memory_fraction(0.3, 0)` on an
  80 GB A100 (= 24 GB visible) — peak GPU usage during compile stays at
  ~19 GB.
- Both `[CLEANUP]` lines fire and report ~18.29 GB freed per method.
- `qwen3_5_moe_runner` inference produces coherent text and matches
  the perf of an unconstrained-VRAM export within measurement noise
  (1903 tok/s prefill, 160 tok/s decode on A100 with `--cuda_graph=true`,
  571-token prompt + 128 generated, GPU peak 18 GB).

## What should eventually move upstream

The three workarounds here all paper over real PyTorch issues that
deserve a proper fix in core:

1. **`_unlift_graph` cloning on the target device.** Cloning lifted
   buffers onto whatever device they happen to live on is not free —
   for large models the clone alone OOMs. Inductor should either
   stage the clone on CPU explicitly or expose an option to do so.
   Today we have to monkey-patch `clone_preserve_strides` *and* the
   wrapper's device codegen to compensate; both could be replaced by a
   first-class API such as `aot_compile(..., clone_buffers_on="cpu")`
   plus an internal "original device of constant" record so the C++
   wrapper writes the right `constants_info_`.

2. **Inductor / Triton caches leak compile-time CUDA tensors.** After
   `aot_compile` returns the `.so` and `.ptd` are written, the result
   bytes are in our hands, and every CUDA tensor still alive is by
   definition stale. Today we have to walk `gc.get_objects()` and
   manually release each storage. Inductor should drain its own caches
   (`PyCodeCache`, `CompiledFxGraph`, `CachingAutotuner`, …) at the end
   of an `aot_compile` call, or at least expose a
   `torch._inductor.reset_compile_caches()` helper.

3. **`wrap_triton` has no `mutates_args` parameter.** This is the
   underlying reason `identify_mutated_tensors` exists at all: the
   inner HOP for a Triton kernel call has to re-derive mutations from
   TTIR because the user can't declare them. A future
   `wrap_triton(kernel, mutates_args={"C"})` would let kernel authors
   short-circuit the TTIR analysis (and avoid the historical fallback
   that marks every input as mutated, which still shows up in older
   PyTorch / Triton combinations).

Once those land, the entire monkey-patch block in this file can be
deleted.

## Test Plan

- Manual: ran `python -m executorch.examples.models.qwen3_5_moe.export
  --prequantized ... --backend cuda` with
  `torch.cuda.set_per_process_memory_fraction(0.3, 0)` (24 GB cap on an
  80 GB A100). Export succeeded; both `[CLEANUP]` lines fired; peak GPU
  usage stayed under 24 GB.
- Manual: ran `qwen3_5_moe_runner` against the exported `.pte` /
  `.ptd`. Inference produced coherent output, prefill 1903 tok/s,
  decode 160 tok/s with `--cuda_graph=true`, GPU peak 18 GB.
- Unaffected backends: Metal / other AOTI backends inherit the default
  `BackendDetails.preprocess_multimethod` (no cleanup) and are not
  touched by this diff.
## Summary

Add a GPU memory regression guard so that the Qwen3.5 MoE export keeps
fitting on consumer-grade 24 GB GPUs (RTX 4090 / 3090 / A5000 …).

## What this diff does

1. `examples/models/qwen3_5_moe/export.py`
   - Reset CUDA peak memory stats at the start of the CUDA backend setup.
   - At the end of `main()`, when running with `--backend cuda`, print a
     stable, machine-parseable marker line:
       `EXPORT_GPU_PEAK_MEMORY_MB: <peak_in_MB>`
     This makes the actual peak GPU memory consumed by the entire
     load + quantize + lower pipeline visible to both humans and CI.

2. `.ci/scripts/export_model_artifact.sh` (qwen3_5_moe path)
   - Tee the export output to a temp log.
   - Grep the `EXPORT_GPU_PEAK_MEMORY_MB` marker and compare against
     `EXPORT_GPU_PEAK_MB_LIMIT` (default 20480 MB = 20 GB; overridable
     via env var).
   - Fail the job with an explanatory error if the budget is exceeded,
     so any future regression that reintroduces the ~18 GB unnecessary
     GPU clone (or comparable leak) is caught at PR time rather than
     silently breaking 24 GB-class GPUs.

## Notes

- Current measured peak with the CUDA backend memory fixes (see prior
  commit on this branch) is ~18 GB, leaving ~2 GB headroom under the
  20 GB limit. Without those fixes the peak shoots to ~37 GB and CI
  will fail loudly.
- The threshold is intentionally tighter than the 24 GB physical cap
  to leave room for measurement noise and small allocator overhead.

## Test Plan

- Manual: ran `python -m executorch.examples.models.qwen3_5_moe.export
  --prequantized <hqq-int4-bundle> --backend cuda` and confirmed the
  marker line is printed at the end with a sensible value (~18 GB).
- Manual: simulated CI gate logic locally with the marker line and
  confirmed both the success path and the failure path (forced
  threshold below the actual peak) behave as expected.
@Gasoonjia Gasoonjia requested a review from lucylq as a code owner April 29, 2026 08:04
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 29, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19205

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 18 New Failures, 5 Cancelled Jobs, 8 Unrelated Failures

As of commit ef8aeb2 with merge base 8bda9ac (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 29, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@Gasoonjia Gasoonjia requested review from digantdesai and mergennachin and removed request for lucylq April 29, 2026 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda ciflow/metal ciflow/mlx CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant