Add Triton INT4 dense kernels with dequant prefill path for Qwen3.5 MoE#19188
Add Triton INT4 dense kernels with dequant prefill path for Qwen3.5 MoE#19188digantdesai wants to merge 4 commits intogh/digantdesai/51/basefrom
Conversation
Add three new Triton kernels for dense W4A16 linear projections that replace tinygemm's tiled INT4 format with simple [N, K//2] packed weights (same format as MoE experts): - int4_matmul: fused dequant+tl.dot GEMM for medium M (prefill crossover) - int4_matvec: bandwidth-optimized vec-mat for M=1 decode - dequant_w4_to_bf16: weight dequant for large-M prefill via Inductor mm W4DequantLinear wraps these with dual decode/prefill dispatch: - Decode (M=1): int4_matvec (73 tok/s, ~35% slower than tinygemm) - Prefill (M>1): dequant+F.linear via Inductor (3400 tok/s at 3K tokens, +67% over tinygemm baseline) Single 18GB weight blob (no duplication). Decode perf regression is a known trade-off for uniform weight format — to be revisited with a CUDA C++ matvec kernel. Also adds INT8 dynamic-activation MoE tests and comprehensive correctness tests (48 tests, all passing at rtol=0.01). Co-authored-by: Claude <noreply@anthropic.com> [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19188
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 8 New Failures, 1 Cancelled Job, 1 Pending, 1 Unrelated FailureAs of commit 3e518f0 with merge base cb4e5ae ( NEW FAILURES - The following jobs have failed:
CANCELLED JOB - The following job was cancelled. Please retry:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
…r Qwen3.5 MoE" Add three new Triton kernels for dense W4A16 linear projections that replace tinygemm's tiled INT4 format with simple [N, K//2] packed weights (same format as MoE experts): - int4_matmul: fused dequant+tl.dot GEMM for medium M (prefill crossover) - int4_matvec: bandwidth-optimized vec-mat for M=1 decode - dequant_w4_to_bf16: weight dequant for large-M prefill via Inductor mm W4DequantLinear wraps these with dual decode/prefill dispatch: - Decode (M=1): int4_matvec (73 tok/s, ~35% slower than tinygemm) - Prefill (M>1): dequant+F.linear via Inductor (3400 tok/s at 3K tokens, +67% over tinygemm baseline) Single 18GB weight blob (no duplication). Decode perf regression is a known trade-off for uniform weight format — to be revisited with a CUDA C++ matvec kernel. Also adds INT8 dynamic-activation MoE tests and comprehensive correctness tests (48 tests, all passing at rtol=0.01). Co-authored-by: Claude <noreplyanthropic.com> [ghstack-poisoned]
…r Qwen3.5 MoE" Add three new Triton kernels for dense W4A16 linear projections that replace tinygemm's tiled INT4 format with simple [N, K//2] packed weights (same format as MoE experts): - int4_matmul: fused dequant+tl.dot GEMM for medium M (prefill crossover) - int4_matvec: bandwidth-optimized vec-mat for M=1 decode - dequant_w4_to_bf16: weight dequant for large-M prefill via Inductor mm W4DequantLinear wraps these with dual decode/prefill dispatch: - Decode (M=1): int4_matvec (73 tok/s, ~35% slower than tinygemm) - Prefill (M>1): dequant+F.linear via Inductor (3400 tok/s at 3K tokens, +67% over tinygemm baseline) Single 18GB weight blob (no duplication). Decode perf regression is a known trade-off for uniform weight format — to be revisited with a CUDA C++ matvec kernel. Also adds INT8 dynamic-activation MoE tests and comprehensive correctness tests (48 tests, all passing at rtol=0.01). Co-authored-by: Claude <noreplyanthropic.com> [ghstack-poisoned]
…r Qwen3.5 MoE" Add three new Triton kernels for dense W4A16 linear projections that replace tinygemm's tiled INT4 format with simple [N, K//2] packed weights (same format as MoE experts): - int4_matmul: fused dequant+tl.dot GEMM for medium M (prefill crossover) - int4_matvec: bandwidth-optimized vec-mat for M=1 decode - dequant_w4_to_bf16: weight dequant for large-M prefill via Inductor mm W4DequantLinear wraps these with dual decode/prefill dispatch: - Decode (M=1): int4_matvec (73 tok/s, ~35% slower than tinygemm) - Prefill (M>1): dequant+F.linear via Inductor (3400 tok/s at 3K tokens, +67% over tinygemm baseline) Single 18GB weight blob (no duplication). Decode perf regression is a known trade-off for uniform weight format — to be revisited with a CUDA C++ matvec kernel. Also adds INT8 dynamic-activation MoE tests and comprehensive correctness tests (48 tests, all passing at rtol=0.01). Co-authored-by: Claude <noreplyanthropic.com> [ghstack-poisoned]
Gasoonjia
left a comment
There was a problem hiding this comment.
Please update ci to cover the new changes
| if hasattr(layer, "mlp") and hasattr(layer.mlp, "experts"): | ||
| layer.mlp.experts.use_batched_moe = enabled | ||
| layer.mlp.experts.moe_activation_dtype = moe_activation_dtype | ||
| layer.mlp.experts.moe_moe_moe_activation_dtype = moe_moe_moe_activation_dtype |
There was a problem hiding this comment.
why moe_moe_moe_activation_dtype ? what does the triple MoEs mean?
There was a problem hiding this comment.
Claude going rogue
Stack from ghstack (oldest at bottom):
Add three new Triton kernels for dense W4A16 linear projections that
replace tinygemm's tiled INT4 format with simple [N, K//2] packed weights
(same format as MoE experts):
W4DequantLinear wraps these with dual decode/prefill dispatch:
+67% over tinygemm baseline)
Single 18GB weight blob (no duplication). Decode perf regression is a
known trade-off for uniform weight format — to be revisited with a
CUDA C++ matvec kernel.
Also adds INT8 dynamic-activation MoE tests and comprehensive correctness
tests (48 tests, all passing at rtol=0.01).
Co-authored-by: Claude noreply@anthropic.com