Add Triton INT4 dense kernels with dequant prefill path for Qwen3.5 MoE by digantdesai · Pull Request #19188 · pytorch/executorch

digantdesai · 2026-04-28T15:56:36Z

Stack from ghstack (oldest at bottom):

Add three new Triton kernels for dense W4A16 linear projections that
replace tinygemm's tiled INT4 format with simple [N, K//2] packed weights
(same format as MoE experts):

int4_matmul: fused dequant+tl.dot GEMM for medium M (prefill crossover)
int4_matvec: bandwidth-optimized vec-mat for M=1 decode
dequant_w4_to_bf16: weight dequant for large-M prefill via Inductor mm

W4DequantLinear wraps these with dual decode/prefill dispatch:

Decode (M=1): int4_matvec (73 tok/s, ~35% slower than tinygemm)
Prefill (M>1): dequant+F.linear via Inductor (3400 tok/s at 3K tokens,
+67% over tinygemm baseline)

Single 18GB weight blob (no duplication). Decode perf regression is a
known trade-off for uniform weight format — to be revisited with a
CUDA C++ matvec kernel.

Also adds INT8 dynamic-activation MoE tests and comprehensive correctness
tests (48 tests, all passing at rtol=0.01).

Co-authored-by: Claude noreply@anthropic.com

Add three new Triton kernels for dense W4A16 linear projections that replace tinygemm's tiled INT4 format with simple [N, K//2] packed weights (same format as MoE experts): - int4_matmul: fused dequant+tl.dot GEMM for medium M (prefill crossover) - int4_matvec: bandwidth-optimized vec-mat for M=1 decode - dequant_w4_to_bf16: weight dequant for large-M prefill via Inductor mm W4DequantLinear wraps these with dual decode/prefill dispatch: - Decode (M=1): int4_matvec (73 tok/s, ~35% slower than tinygemm) - Prefill (M>1): dequant+F.linear via Inductor (3400 tok/s at 3K tokens, +67% over tinygemm baseline) Single 18GB weight blob (no duplication). Decode perf regression is a known trade-off for uniform weight format — to be revisited with a CUDA C++ matvec kernel. Also adds INT8 dynamic-activation MoE tests and comprehensive correctness tests (48 tests, all passing at rtol=0.01). Co-authored-by: Claude <noreply@anthropic.com> [ghstack-poisoned]

pytorch-bot · 2026-04-28T15:56:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19188

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rolling out OSDC (ARC) runners on pull & trunk workflows in PyTorch main

❌ 8 New Failures, 1 Cancelled Job, 1 Pending, 1 Unrelated Failure

As of commit 3e518f0 with merge base cb4e5ae ():

NEW FAILURES - The following jobs have failed:

MLX / backend-tester (models) / test-mlx-backend-models (gh)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 1
MLX / test-mlx-qwen35-moe / test-mlx-qwen35-moe (gh)
AttributeError: 'Namespace' object has no attribute 'moe_moe_activation_dtype'. Did you mean: 'moe_activation_dtype'?
pull / android / run-emulator (gh)
The process '/usr/bin/sh' failed with exit code 1
pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t f441f8162ae9c3d1274790c828ac088625fc961b972e0920442034bb51fac453 /exec failed with exit code 139
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (facebook, dinov2-small-imagenet1k-1-layer, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t b09c5324e7129bcca35eba9380520cab29fb59678a09a7c1c2c585f51b413815 /exec failed with exit code 1
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (nvidia, parakeet-tdt, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 0a0bd53522b7d897be5d4b2c35fe6be3173de09c4cfd833e7e374a317a4270da /exec failed with exit code 1
Test CUDA Windows Export and E2E / export-model-cuda-windows-artifact (nvidia, parakeet-tdt, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t 91d8a417748365f3830dbda60cd383f810ba45796209f448823ae28eba31ccf8 /exec failed with exit code 1
Test Metal Backend / test-metal-qwen35-moe-tiny / macos-job (gh)
AttributeError: 'Namespace' object has no attribute 'moe_moe_activation_dtype'. Did you mean: 'moe_activation_dtype'?

CANCELLED JOB - The following job was cancelled. Please retry:

Test CoreML Backend / test-coreml / test-backend-macos (coreml, operators) / macos-job (gh)
##[error]The operation was canceled.

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

MLX / test-mlx-voxtral / test-mlx-voxtral (gh) (trunk failure)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-04-28T15:57:42Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…r Qwen3.5 MoE" Add three new Triton kernels for dense W4A16 linear projections that replace tinygemm's tiled INT4 format with simple [N, K//2] packed weights (same format as MoE experts): - int4_matmul: fused dequant+tl.dot GEMM for medium M (prefill crossover) - int4_matvec: bandwidth-optimized vec-mat for M=1 decode - dequant_w4_to_bf16: weight dequant for large-M prefill via Inductor mm W4DequantLinear wraps these with dual decode/prefill dispatch: - Decode (M=1): int4_matvec (73 tok/s, ~35% slower than tinygemm) - Prefill (M>1): dequant+F.linear via Inductor (3400 tok/s at 3K tokens, +67% over tinygemm baseline) Single 18GB weight blob (no duplication). Decode perf regression is a known trade-off for uniform weight format — to be revisited with a CUDA C++ matvec kernel. Also adds INT8 dynamic-activation MoE tests and comprehensive correctness tests (48 tests, all passing at rtol=0.01). Co-authored-by: Claude <noreplyanthropic.com> [ghstack-poisoned]

Gasoonjia

Please update ci to cover the new changes

Gasoonjia · 2026-04-29T18:34:34Z

        if hasattr(layer, "mlp") and hasattr(layer.mlp, "experts"):
            layer.mlp.experts.use_batched_moe = enabled
-            layer.mlp.experts.moe_activation_dtype = moe_activation_dtype
+            layer.mlp.experts.moe_moe_moe_activation_dtype = moe_moe_moe_activation_dtype


why moe_moe_moe_activation_dtype ? what does the triple MoEs mean?

Claude going rogue

digantdesai requested a review from lucylq as a code owner April 28, 2026 15:56

digantdesai mentioned this pull request Apr 28, 2026

Add W4A8 INT8 activation kernels for batched MoE prefill #19187

Open

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 28, 2026

This was referenced Apr 28, 2026

Remove benchmark scripts from git tracking #19189

Open

Add structured stats reporting and GPU memory tracking to Qwen3.5 MoE runner #19190

Open

digantdesai requested review from Gasoonjia and mergennachin April 28, 2026 21:08

digantdesai added 2 commits April 28, 2026 14:18

Gasoonjia approved these changes Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Triton INT4 dense kernels with dequant prefill path for Qwen3.5 MoE#19188

Add Triton INT4 dense kernels with dequant prefill path for Qwen3.5 MoE#19188
digantdesai wants to merge 4 commits intogh/digantdesai/51/basefrom
gh/digantdesai/51/head

digantdesai commented Apr 28, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

Gasoonjia left a comment

Uh oh!

Gasoonjia Apr 29, 2026

Uh oh!

digantdesai Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

digantdesai commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19188

❗ 1 Active SEVs

❌ 8 New Failures, 1 Cancelled Job, 1 Pending, 1 Unrelated Failure

Uh oh!

github-actions Bot commented Apr 28, 2026

This PR needs a release notes: label

Uh oh!

Gasoonjia left a comment

Choose a reason for hiding this comment

Uh oh!

Gasoonjia Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

digantdesai Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

digantdesai commented Apr 28, 2026 •

edited

Loading

pytorch-bot Bot commented Apr 28, 2026 •

edited

Loading

This PR needs a `release notes:` label