Gemma 4 builder scaffolding (Part 1 of N, #2062)#2088
Draft
jaburges wants to merge 1 commit intomicrosoft:mainfrom
Draft
Gemma 4 builder scaffolding (Part 1 of N, #2062)#2088jaburges wants to merge 1 commit intomicrosoft:mainfrom
jaburges wants to merge 1 commit intomicrosoft:mainfrom
Conversation
Scaffolds Gemma 4 support in the Python model builder. This is intentionally
Part 1 of N: it wires the dispatcher, introduces `Gemma4TextModel`, and
implements the Gemma 4-specific builder hooks that can be handled entirely
on the builder side. Features that require C++ runtime changes (per-layer
head_size, Per-Layer Embeddings, KV cache sharing) raise a clear
`NotImplementedError` unless `config_only=true`.
What this change covers
-----------------------
* `builders/gemma4.py` (new): `Gemma4TextModel(Gemma3Model)` with:
- `is_local(layer_id)` driven by `config.layer_types` instead of
Gemma 3's hard-coded every-6th rule. Gemma 4 exposes the full list
(28x sliding + 7x full on E2B).
- `make_key_value_cache_shape(layer_id, shape)` overridden to emit
`head_dim=256` for sliding layers and `global_head_dim=512` for
full-attention layers via the existing per-layer extension point.
- `make_rotary_embedding_multi_cache()` that builds both RoPE caches
(sliding: theta=10_000, partial_rotary=1.0; full: theta=1_000_000,
partial_rotary=0.25, rope_type=proportional) from
`config.rope_parameters`.
- Explicit unsupported-feature gate that raises `NotImplementedError`
with an actionable message when PLE, KV cache sharing, or variable
head dim are actually required (i.e. not in `config_only` mode).
* `builder.py` / `builders/__init__.py`: dispatch for
`Gemma4ForConditionalGeneration` and `Gemma4ForCausalLM`. Flattens
`text_config` onto the top-level config the way the Gemma 3 branch
does, sets `exclude_embeds=true`, stamps `model_type="gemma4_text"`.
* `test/python/test_gemma4_builder.py` (new): 10 unit tests for the
Gemma 4-specific overrides. Uses a stubbed `Gemma3Model` parent so the
tests run in a clean Python env without torch/transformers/onnxruntime.
What this change does NOT cover
-------------------------------
These are explicitly out of scope for Part 1 because each requires a
paired C++ runtime change:
1. Per-Layer Embeddings (PLE) routing. Gemma 4's `embed_tokens` emits
`per_layer_inputs` [B, S, num_hidden_layers, hidden_size_per_layer_input]
alongside `inputs_embeds`. Each decoder layer consumes its own slice.
Needs a new side-channel input both in the ONNX graph and in the C++
model loader.
2. KV cache sharing. Gemma 4 E2B has 35 layers but only 15 own unique
KV caches (`num_kv_shared_layers=20`). `kv_cache.cpp` currently
allocates one pair per layer.
3. Per-layer `head_size` in the C++ runtime. Even with this builder
emitting correct per-layer past_kv shapes, `kv_cache.cpp` allocates
using a single `model.decoder.head_size` from `genai_config.json`
for all layers. The config schema and the runtime's KV cache manager
need to grow per-layer awareness (or a `head_size` + `global_head_size`
split driven by the same `layer_types` pattern).
The builder raises `NotImplementedError` listing exactly which of the
above gaps were tripped, so downstream consumers get an actionable
error rather than a broken graph.
Verification
------------
`python test/python/test_gemma4_builder.py -v` passes 10/10:
- is_local correctly reflects `layer_types` for every layer 0..34
- make_key_value_cache_shape emits 512 for layers 4/9/14/19/24/29/34
and 256 for every other layer
- The unsupported-feature gate raises for PLE, KV sharing, and variable
head dim in isolation
- config_only mode bypasses the gate cleanly
- Both RoPE caches are emitted with the correct theta and partial
rotary factor
References
----------
* Issue: microsoft#2062
* Gemma 4 E2B config:
https://huggingface.co/google/gemma-4-E2B-it/blob/main/config.json
* Community ONNX export (architectural reference only; uses an
incompatible I/O contract): https://huggingface.co/onnx-community/gemma-4-E2B-it-ONNX
Credit to @elbruno for the original architectural analysis in microsoft#2062.
Made-with: Cursor
Author
|
@microsoft-github-policy-service agree |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Scaffolds Gemma 4 support in the Python model builder, addressing part of #2062. Intentionally opened as a Draft because this is Part 1 of N — the builder-side work only. See "Scope" below for exactly what is and isn't covered, and why a functional end-to-end path requires paired C++ runtime changes.
Credit to @elbruno for the original architectural write-up in #2062; this PR follows that decomposition.
Motivation
Google released the Gemma 4 family (E2B / E4B / 26B-A4B / 31B) in April 2026. All variants use a new architecture with three features that don't fit GenAI's current assumptions:
embed_tokensemits two tensors; each decoder layer consumes its own slice.head_dim=256, full-attention layers useglobal_head_dim=512.num_kv_shared_layers=20).Existing workarounds (patching
builder.pyto route through Gemma 3, or using theonnx-community/gemma-4-E2B-it-ONNXexport) fail with shape errors or I/O-contract mismatches. See #2062 for the full analysis.Scope
✅ Covered by this PR
builders/gemma4.py(new):Gemma4TextModel(Gemma3Model)overriding:is_local(layer_id)driven byconfig.layer_types(Gemma 3 used a hard-coded every-6th-layer rule; Gemma 4 exposes the full list).make_key_value_cache_shape(layer_id, shape)using the existing per-layer extension point to emithead_dim=256on sliding layers andglobal_head_dim=512on full-attention layers.make_rotary_embedding_multi_cache()building the two RoPE caches Gemma 4 needs (sliding:theta=10_000,partial_rotary_factor=1.0; full:theta=1_000_000,partial_rotary_factor=0.25,rope_type=proportional) fromconfig.rope_parameters.builder.py: dispatch forGemma4ForConditionalGenerationandGemma4ForCausalLM, withtext_configflattening andexclude_embeds=truedefaulting, matching the existing Gemma 3 pattern.builders/__init__.py: exportsGemma4TextModel.test/python/test_gemma4_builder.py(new): 10 unit tests for the Gemma 4-specific overrides. Runs in a clean Python env — no torch / transformers / onnxruntime needed — by stubbing theGemma3Modelparent.❌ Not covered — requires C++ runtime changes (Parts 2/3/4)
These are explicitly out of scope. The builder raises
NotImplementedErrorlisting exactly which ones were tripped, so users get an actionable error rather than a broken graph:per_layer_inputsside-channel input both in the ONNX graph (newmake_inputs_and_outputsbranch) and in the C++ model loader. Touchessrc/models/model.ccand friends.src/models/kv_cache.cppcurrently allocates one past/present pair per layer. Gemma 4 needs layer aliasing (20 of 35 layers reuse an earlier layer's KV).head_sizein the runtime. Even with this builder emitting correct per-layer past_kv shapes,kv_cache.cpp:17reads a singlemodel_.config_->model.decoder.head_sizeand applies it uniformly. Thegenai_config.jsonschema and the KV cache manager need per-layer awareness, or ahead_size+global_head_sizesplit driven by the samelayer_typespattern the builder already understands.Until those land, this builder supports
--extra_options config_only=trueonly. That's still useful for validating the dispatcher and for downstream tooling that only needsgenai_config.json/ processing files. The gate makes that explicit rather than silently emitting a broken graph.Verification
The tests pin down:
is_localmatches the expected E2B pattern — full layers at exactly 4, 9, 14, 19, 24, 29, 34.make_key_value_cache_shapeemits 512 on those seven layers and 256 everywhere else.config_only=truemode bypasses the gate cleanly.Request for maintainer review
Three specific things I'd welcome feedback on before taking Parts 2/3 further:
head_size, do you prefer (a) ahead_sizelist ingenai_config.json, (b) ahead_size+global_head_sizepair plus alayer_types-like pattern, or (c) keeping both values in the model metadata only and teachingkv_cache.cppto read them from the ONNX graph? I have mild preference for (b) for symmetry with howis_localalready works.head_dim. Once per-layer head_size support lands in the runtime, the builder should emit the full-attention RoPE cache atglobal_head_dim— open to suggestions on the cleanest way to thread that throughmake_rotary_embedding_cacheswithout duplicating logic.References
google/gemma-4-E2B-it/config.jsononnx-community/gemma-4-E2B-it-ONNX