extend modelbuilder to build Olmo3, SmolLM3 and other models#2078
extend modelbuilder to build Olmo3, SmolLM3 and other models#2078
Conversation
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR expands the Python modelbuilder to cover additional HF architectures (e.g., OLMo3, SmolLM3, Qwen3/Qwen3.5, Nemotron-H, Gemma3) and adds a comprehensive “fast” test suite that builds tiny random-weight models offline and validates ONNX vs PyTorch (and/or validates export output). It also adds a compatibility shim to post-process genai_config.json for transformers>=5 generation defaults.
Changes:
- Add many new offline “fast” unit tests that export minimal random-weight models and compare ORT outputs against PyTorch (prefill/decode and some greedy generation).
- Extend/fix multiple builder implementations (OLMo*, SmolLM, Nemotron/Nemotron-H, Gemma3, Ernie, GPT-OSS, etc.) to match architecture quirks and transformers v4/v5 config differences.
- Add
fix_genai_configutility + tests, relax pytest--test_modelsoption requirement, and introduce a GitHub Actions workflow for fast tests.
Reviewed changes
Copilot reviewed 52 out of 53 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| test/python/models/fast/test_random_whisper.py | Adds offline Whisper encoder/decoder export + ORT vs PyTorch discrepancy checks. |
| test/python/models/fast/test_random_smollm3.py | Adds SmolLM3 random-weight discrepancy tests and greedy generation checks. |
| test/python/models/fast/test_random_qwen3_vl.py | Adds Qwen3-VL text-path discrepancy tests (transformers>=5 guarded). |
| test/python/models/fast/test_random_qwen3_5.py | Adds Qwen3.5 full-attention execution test + hybrid build validation (transformers>=5). |
| test/python/models/fast/test_random_qwen3_0_6b.py | Adds Qwen3 random-weight discrepancy and greedy generation tests. |
| test/python/models/fast/test_random_qwen2_5_vl.py | Adds Qwen2.5-VL text-path discrepancy tests (transformers>=5 guarded). |
| test/python/models/fast/test_random_phi4mm.py | Adds Phi4MM synthetic-weight build + ORT vs PyTorch comparisons (uses PEFT LoRA). |
| test/python/models/fast/test_random_phi3.py | Adds Phi-3 random-weight discrepancy + greedy generation tests. |
| test/python/models/fast/test_random_phi.py | Adds Phi-2 random-weight discrepancy + greedy generation tests. |
| test/python/models/fast/test_random_olmo3.py | Adds OLMo3 random-weight discrepancy + greedy generation tests. |
| test/python/models/fast/test_random_olmo2.py | Adds OLMo2 random-weight discrepancy + greedy generation tests. |
| test/python/models/fast/test_random_olmo.py | Adds OLMo (v1) random-weight discrepancy + greedy generation tests. |
| test/python/models/fast/test_random_nemotron.py | Adds Nemotron random-weight discrepancy + greedy generation tests. |
| test/python/models/fast/test_random_nemotron_h.py | Adds Nemotron-H random-weight discrepancy + greedy generation tests (transformers>=5 guarded). |
| test/python/models/fast/test_random_granite.py | Adds Granite random-weight discrepancy + greedy generation tests. |
| test/python/models/fast/test_random_gemma3.py | Adds Gemma3 random-weight discrepancy + greedy generation tests (transformers>=5 guarded). |
| test/python/models/fast/test_random_gemma.py | Adds Gemma random-weight discrepancy + greedy generation tests. |
| test/python/models/fast/test_onnx_generate.py | Adds unit tests for ONNX greedy generation helper (no-cache and KV-cache paths). |
| test/python/models/fast/test_llama_attention_discrepancies.py | Adds attention-only Llama export + numeric discrepancy tests. |
| test/python/models/fast/test_fix_genai_config.py | Adds tests validating fix_genai_config behavior for null generation defaults. |
| test/python/models/fast/test_check_extra_options.py | Adds tests for extra_options parsing/validation behavior. |
| test/python/conftest.py | Makes --test_models optional for pytest runs. |
| src/python/py/models/genai_config_utils.py | Introduces fix_genai_config to fill null search defaults for transformers>=5. |
| src/python/py/models/builders/smollm.py | Fixes per-layer RoPE enable/disable handling for SmolLM. |
| src/python/py/models/builders/phi.py | Minor formatting tweak around position_ids reformatting naming. |
| src/python/py/models/builders/olmo.py | Adds OLMo2/OLMo3 builders + OLMo v1 LayerNorm handling + OLMo2/3 q/k norm + post-norm residual flow. |
| src/python/py/models/builders/nemotron.py | Adds Nemotron-H builder and fixes Nemotron epsilon handling. |
| src/python/py/models/builders/internlm.py | Formatting/quoting cleanup (no apparent logic change in diff shown). |
| src/python/py/models/builders/gptoss.py | Fixes RoPE cache inv_freq math and tightens some style/robustness in bias combining. |
| src/python/py/models/builders/gemma.py | Improves Gemma3 RoPE parameter handling (v4/v5) and adds conditional-generation load_weights path. |
| src/python/py/models/builders/ernie.py | Patches rope_theta extraction from rope_parameters for Ernie4.5 configs. |
| src/python/py/models/builders/init.py | Re-exports fix_genai_config and narrows __all__. |
| .gitignore | Ignores dump_models/ and stats/ produced by fast tests. |
| .github/workflows/modelbuilder_fast_tests.yml | Adds a GitHub Actions workflow to run fast modelbuilder tests across transformers versions. |
…n overriding method' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
… in attention test Agent-Logs-Url: https://github.com/microsoft/onnxruntime-genai/sessions/76ce540d-7d1c-44e8-9035-a1230aeaaddd Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com>
| model = AutoModelForCausalLM.from_pretrained(model_id, ignore_mismatched_sizes=True, dtype=dtype) | ||
| model.eval().to(provider).to(dtype) |
There was a problem hiding this comment.
AutoModelForCausalLM.from_pretrained(...) does not accept a dtype= kwarg in the Transformers API used elsewhere in this repo (it uses torch_dtype=). As written this will raise a TypeError and the long test can't run.
| precision=precision, | ||
| model_id=model_id, | ||
| experiment="forward", | ||
| provider="cuda", | ||
| test=f"test_trained_{smodel_id}_discrepancies_{precision}_{provider}", | ||
| input_type="text", |
There was a problem hiding this comment.
The discrepancy metadata hard-codes provider="cuda" even when the test is executed with provider="cpu", which will mislabel results and make debugging harder. Use the provider parameter value instead.
| prompt_len = torch_feed["input_ids"].shape[1] | ||
| with torch.no_grad(): | ||
| pt_output = model.generate(**torch_feed, max_new_tokens=max_new_tokens, do_sample=False, pad_token_id=tokenizer.eos_token_id) | ||
| # Keep only the newly generated tokens (exclude the prompt). | ||
| pt_tokens = pt_output[0][prompt_len:].tolist() | ||
|
|
||
| # ------------------------------------------------------------------ | ||
| # onnxruntime-genai greedy generation | ||
| # ------------------------------------------------------------------ | ||
| og_model = og.Model(os.path.dirname(onnx_path)) | ||
|
|
||
| params = og.GeneratorParams(og_model) | ||
| params.set_search_options(do_sample=False, max_length=max_new_tokens, temperature=1.0, top_k=1) | ||
|
|
||
| generator = og.Generator(og_model, params) |
There was a problem hiding this comment.
onnxruntime-genai's max_length is the total sequence length (prompt + generated). This sets max_length=max_new_tokens, which will stop generation early for non-empty prompts and can cause token mismatches vs transformers.generate(max_new_tokens=...). Set max_length=prompt_len + max_new_tokens (or use a max_new_tokens option if available).
| """ | ||
| This test file is ignore unless you set ``LONGTEST=1``. | ||
| It verifies a pretrained text model, discrepancies at prefill and decoding steps. |
There was a problem hiding this comment.
Spelling/grammar: "This test file is ignore" → "This test file is ignored".
| onnx_path = os.path.join(output_dir, "model.onnx") | ||
| self.assertExists(onnx_path) | ||
| sess = self.check_ort(onnx_path) | ||
|
|
There was a problem hiding this comment.
This test builds the model with execution_provider=provider (including CUDA), but uses self.check_ort(...) which forces a CPU-only InferenceSession. For CUDA subtests this won't exercise the CUDA EP (and can hide CUDA-only issues). Use _check_with_ort(onnx_path, cpu=provider == "cpu") instead.
| onnx_path = os.path.join(output_dir, "model.onnx") | ||
| self.assertExists(onnx_path) | ||
| sess = self.check_ort(onnx_path) | ||
|
|
There was a problem hiding this comment.
This test builds the model with execution_provider=provider (including CUDA), but uses self.check_ort(...) which forces a CPU-only InferenceSession. For CUDA subtests this won't exercise the CUDA EP. Use _check_with_ort(onnx_path, cpu=provider == "cpu") instead.
| - name: Install nightly pytorch (cpu) | ||
| if: matrix.torch-version == 'nightly' | ||
| run: pip install --pre --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu | ||
|
|
There was a problem hiding this comment.
Can we use the requirements.txt files in test/python per device target instead of specifying packages here?
There was a problem hiding this comment.
we can. I installed torch cpu first to avoid pip try the gpu version which takes a while to install.
|
|
||
| - name: Upload stats artifact | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 |
There was a problem hiding this comment.
Is there a way we can have a bot that adds these results as a comment on the PRs as well? Then, we don't have to download each zip file on every run.
There was a problem hiding this comment.
Added in ccee173. After the artifact upload, a new Post test results as PR comment step uses actions/github-script to read all CSV files from stats/, convert them to collapsible markdown tables, and create (or update) a PR comment. Each matrix combination gets a unique HTML marker so re-runs update the existing comment instead of creating duplicates. The job also needs pull-requests: write permission, which was added.
|
|
||
| onnx_model = GPTOSSModel(config, io_dtype, onnx_dtype, execution_provider, cache_dir, extra_options) | ||
| elif config.architectures[0] == "GraniteForCausalLM": | ||
| from .builders.granite import GraniteModel |
There was a problem hiding this comment.
What is the motivation for moving the imports here?
There was a problem hiding this comment.
To use multiple version of transformers, 4.57, 5.5. Some models are not available in 4.57. This way it is importing what it needs when it needs it.
There was a problem hiding this comment.
The local imports inside each elif block were already present in the original code — this PR didn't move them. They avoid loading all model builder modules at import time; only the relevant one is loaded when a specific architecture is requested.
| "input_ids": ir.DataType.INT64, # For standard models | ||
| "attention_mask": ir.DataType.INT64, # For standard models | ||
| "position_ids": ir.DataType.INT64, # For standard models | ||
| "inputs_embeds": self.io_dtype, # For standard models where you want to remove the embedding layer from the model (note that `inputs_embeds` is written this way to match Hugging Face format) |
There was a problem hiding this comment.
Can we revert the spacing changes in this file? These spacing changes were there before from an automatic linter, and many of the spacing changes make it harder to read/compare lines.
There was a problem hiding this comment.
I should be able to do that. But it would be great to have some rules I could use to normalize the style. Why not merging a PR changing the style and only the style, then merge this one?
There was a problem hiding this comment.
Restored in 0b169f8. All the attribute dicts in __init__ (input_types, input_shapes, output_names, output_types, output_shapes, mask_attrs, embed_attrs, layernorm_attrs, rope_attrs, attention_attrs, mlp_attrs, moe_attrs, lm_head_attrs) and other long-line single-expressions (window_size, intermediate_size, num_kv_heads, activation, graph, attn_softcap, kernel_block_size, lm_head_softcap) were reverted to the original column-aligned style.
| :param genai_config: genai_config dict (modified in-place and returned). | ||
| :return: the modified *genai_config* dict. | ||
| """ | ||
| search = genai_config.get("search", {}) |
There was a problem hiding this comment.
The following try-except block is attempting to identify default values.
onnxruntime-genai/src/python/py/models/builders/base.py
Lines 547 to 566 in 883a04c
Can we modify this inside make_genai_config instead of making a new fix_genai_config method? We can initialize GENAI_SEARCH_DEFAULTS inside make_genai_config.
| # HuggingFace YaRN formula: | ||
| # final_mscale = yarn_get_mscale(factor, mscale) / yarn_get_mscale(factor, mscale_all_dim) | ||
| # where yarn_get_mscale(s, m) = 0.1 * m * log(s) + 1 if s > 1 else 1 | ||
| def _yarn_get_mscale(s, m): |
There was a problem hiding this comment.
Can we modify the existing make_mscale methods or add a new make_* method to perform the new computations rather than nesting this here?
| def is_gqa_supported(self) -> bool: | ||
| valid_gqa_configurations = { | ||
| ("cpu", ir.DataType.FLOAT), | ||
| ("cpu", ir.DataType.FLOAT16), |
There was a problem hiding this comment.
Many FP16 CPU kernels are not implemented. This will cause a performance regression for users. Can we remove this?
There was a problem hiding this comment.
Removed in 0b169f8. ("cpu", ir.DataType.FLOAT16) was removed from is_gqa_supported() with an explanatory comment. Many FP16 CPU GQA kernels are unimplemented so including this configuration would cause a performance regression.
| # OLMo v1 uses standard LayerNorm (with mean subtraction), not RMSNorm. | ||
| self.layernorm_attrs["simple"] = False | ||
| # OlmoLayerNorm hard-codes eps=1e-5 and OlmoConfig has no rms_norm_eps/layer_norm_eps | ||
| # attribute, so override the epsilon that the base class defaulted to 1e-6. |
There was a problem hiding this comment.
The default here does not match the comment. Can we move this parsing into the base class so it is done automatically?
|
|
||
| def make_layernorm(self, layer_id, layernorm, skip, simple, location): | ||
| # OlmoLayerNorm has no learnable weight or bias; inject identity values so | ||
| # the base class make_layernorm_op can create the initializer tensors it needs. |
There was a problem hiding this comment.
If there are no weights or biases to set, do we even need to create the LayerNorm ops? Could we override make_layer to not create such ops?
|
|
||
| # For non-quantized models, load the Hugging Face model | ||
| print("Loading Qwen2_5_VLForConditionalGeneration model...") | ||
| from transformers import Qwen2_5_VLForConditionalGeneration |
There was a problem hiding this comment.
Why remove the global imports and use local imports?
There was a problem hiding this comment.
To handle multiple versions of transformers.
There was a problem hiding this comment.
These local imports inside load_weights (e.g. from transformers import Qwen2_5_VLForConditionalGeneration) were already present in the original code — this PR didn't move them. They're kept local to avoid importing heavy VLM classes at module load time unless that specific model class is actually used.
| # GQA is not supported in Whisper since there is no attention mask input | ||
| return False | ||
|
|
||
| def is_packed_attn_supported(self): |
There was a problem hiding this comment.
Packed attention is supported in the encoder. You can download the ONNX models from the Foundry Local catalog to see what they look like.
| "hidden_states": ["batch_size", self.max_source_positions, self.hidden_size], # ['batch_size', 'num_frames / 2', 'hidden_size'] | ||
| "present_key_cross": ["batch_size", self.num_attn_heads, self.max_source_positions, self.head_size], # ['batch_size', 'num_heads', 'num_frames / 2', 'head_size'] | ||
| "present_value_cross": ["batch_size", self.num_attn_heads, self.max_source_positions, self.head_size] # ['batch_size', 'num_heads', 'num_frames / 2', 'head_size'] | ||
| "hidden_states": [ |
| gelu_1_name = f"{basename}/Gelu_1" | ||
| gelu_1_output = f"{gelu_1_name}/output_0" | ||
| self.make_node("Gelu", inputs=[f"{conv_1_name}/output_0"], outputs=[gelu_1_output], name=gelu_1_name, approximate="none") | ||
| self.make_node("Gelu", inputs=[f"{conv_1_name}/output_0"], outputs=[gelu_1_output], name=gelu_1_name) |
| gelu_2_output = f"{gelu_2_name}/output_0" | ||
| self.make_node("Gelu", inputs=[f"{conv_2_name}/output_0"], outputs=[gelu_2_output], name=gelu_2_name, approximate="none") | ||
| self.make_value(gelu_2_output, dtype=self.io_dtype, shape=["batch_size", self.hidden_size, self.max_source_positions]) | ||
| self.make_node("Gelu", inputs=[f"{conv_2_name}/output_0"], outputs=[gelu_2_output], name=gelu_2_name) |
| from .builders.gemma import Gemma3Model | ||
|
|
||
| onnx_model = Gemma3Model(config, io_dtype, onnx_dtype, execution_provider, cache_dir, extra_options) | ||
| onnx_model.model_type = "gemma3_text" |
There was a problem hiding this comment.
This should be reverted. The model type for the VLM is gemma3, not gemma3_text, because the model type is derived from the architecture name in the config. The LLM uses gemma3_text since it is text-only. This VLM has image and text support so it uses gemma3 as the model type instead (see here for reference).
There was a problem hiding this comment.
Fixed in 0b169f8. Changed onnx_model.model_type = "gemma3_text" to onnx_model.model_type = "gemma3" for the Gemma3ForConditionalGeneration (VLM) case.
| fp32_webgpu = execution_provider == "webgpu" and extra_options.get("use_webgpu_fp32", False) | ||
| bf16_cuda = precision == "int4" and execution_provider in {"cuda", "trt-rtx"} and extra_options.get("use_cuda_bf16", False) | ||
| bf16_cuda = ( | ||
| precision == "int4" and execution_provider in {"cuda", "trt-rtx"} and extra_options.get("use_cuda_bf16", False) |
|
|
||
|
|
||
| class PvVersion: | ||
| """Simple version of packaging.version.Version.""" |
There was a problem hiding this comment.
Why can't we import packaging and use it directly?
| shutil.rmtree(path) | ||
|
|
||
| def get_dirs(self, prefix: str, clean: bool = True) -> tuple[str]: | ||
| output_dir = f"dump_models/{prefix}/output" |
There was a problem hiding this comment.
Can we use path utils to create the path rather than hardcoding a string (e.g. os.path.join or Path(...) instead) for cross-platform compatibility?
| return lambda x: x | ||
|
|
||
|
|
||
| class ExtTestCase(unittest.TestCase): |
There was a problem hiding this comment.
What is ext an abbreviation for? Can we use the full name and rename the file accordingly?
| if not full.startswith(prefix): | ||
| raise AssertionError(f"prefix={prefix!r} does not start string {full!r}.") | ||
|
|
||
| def check_ort(self, onx: Union["onnx.ModelProto", str]) -> "onnxruntime.InferenceSession": # noqa: F821 # noqa: F821 |
There was a problem hiding this comment.
Why test with ORT directly? Rather than test models with ORT, let's rewrite this file to test with ORT GenAI directly. That will test ORT as well.
| _MODEL_ID = "arnir0/Tiny-LLM" | ||
|
|
||
|
|
||
| class _AttentionOnlyLlamaModel(LlamaModel): |
There was a problem hiding this comment.
Should we create a 1-layer ONNX model and a 1-layer torch model to compare instead of attention-only? Then, we know each layer is being created correctly. In that way, we can also use ORT GenAI for the comparison instead of direct ORT.
|
|
||
|
|
||
| def _make_tiny_lm_no_cache(fixed_token: int = 3) -> onnx.ModelProto: | ||
| """Returns a minimal ONNX LM with no KV cache. |
There was a problem hiding this comment.
Given we don't support models without KV cache inputs/outputs and given we can pass in a 0-filled KV cache to essentially not use a KV cache, I'm not sure how much gain we get from this test. Can we remove this?
| # Minimal ChatGLM model source code saved to the model directory so that | ||
| # AutoModelForCausalLM.from_pretrained(..., trust_remote_code=True) can load | ||
| # the model completely offline during fast tests. | ||
| _MODELING_CODE = """\ |
There was a problem hiding this comment.
Can we instead pre-install a version of transformers inside our VM images so we don't have to save the source code? Then, we read the modeling file from the local installation. It will be tedious to have to update this modeling code each time there's a change on Hugging Face's interfaces with newer transformers versions.
| return model, config | ||
|
|
||
|
|
||
| class TestChatGLM(ExtTestCase): |
There was a problem hiding this comment.
There are a lot of modeling files for testing so I won't write the same comment on each one. But we should use ORT GenAI for testing instead of direct ORT since ORT GenAI will test both.
|
|
||
| import numpy as np | ||
| from ext_test_case import ExtTestCase, hide_stdout, requires_cuda, run_session_or_io_binding | ||
|
|
There was a problem hiding this comment.
All of these LLMs have the same input/output structure when using ORT GenAI. Can we consolidate all of these files into a generic test_random_llm.py file instead? That would greatly reduce the amount of code to review, test, update, etc.
|
|
||
|
|
||
| def get_path_for_model(data_path, model_name, precision, device): | ||
| if data_path is None: |
There was a problem hiding this comment.
Can we instead move the generated torch models to be stored inside the location in test_models? The CIs are heavily dependent on this flag existing.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
… FP16 CPU GQA, add PR comment step Agent-Logs-Url: https://github.com/microsoft/onnxruntime-genai/sessions/b79ebc36-b773-4f75-9f29-9f8dd5d65330 Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com>
…kflow Agent-Logs-Url: https://github.com/microsoft/onnxruntime-genai/sessions/b79ebc36-b773-4f75-9f29-9f8dd5d65330 Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com>
Uh oh!
There was an error while loading. Please reload this page.