extend modelbuilder to build Olmo3, SmolLM3 and other models by xadupre · Pull Request #2078 · microsoft/onnxruntime-genai

xadupre · 2026-04-10T10:28:43Z

it adds unit test to check discrepancies on all supported architectures
it fixes a few things in existing architectures detected through the unit tests
it supports transformers>=5 by modifying the final genai config to make it lookl like a transformers<5
it adds a job to run the created models on cpu with the released version of onnxruntime-genai
the tests runs in 3 or 4 minutes on two versions of transformers (4.57, 5.5)
it uploads artifact showing the test results (below an example)

	model_id	experiment	precision	provider	input_type	max_abs_err	avg_abs_discrepancy	dnan	next_token	next_token_id_tch	next_token_id_ort	test	total_diff	kind	step	%>0.1	%>0.01
0	arnir0/Tiny-LLM	attention_decode	fp32	cpu	text	1.3411e-07	3.00374e-08	0	OK	324	324	test_llama_attention_decode_discrepancies				0	0
1	arnir0/Tiny-LLM	attention_prefill	fp32	cpu	text	3.27826e-07	4.81524e-08	0	OK	155	155	test_llama_attention_prefill_discrepancies				0	0
2	THUDM/chatglm3-6b	generate	fp16	cpu	text							test_generation_chatglm_fp16_cpu	0	fast
3	THUDM/chatglm3-6b	generate	fp32	cpu	text							test_generation_chatglm_fp32_cpu	0	fast
4	THUDM/chatglm3-6b	forward	fp16	cpu	text	0.000906795	0.000173806	0	OK	905	905	test_discrepancies_chatglm_fp16_cpu		fast	prefill	0	0
5	THUDM/chatglm3-6b	forward	fp16	cpu	text	0.000891805	0.00018833	0	OK	1322	1322	test_discrepancies_chatglm_fp16_cpu		fast	decode	0	0
6	THUDM/chatglm3-6b	forward	fp32	cpu	text	8.34465e-07	1.66793e-07	0	OK	905	905	test_discrepancies_chatglm_fp32_cpu		fast	prefill	0	0
7	THUDM/chatglm3-6b	forward	fp32	cpu	text	7.7486e-07	1.58349e-07	0	OK	1322	1322	test_discrepancies_chatglm_fp32_cpu		fast	decode	0	0

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Copilot

Pull request overview

This PR expands the Python modelbuilder to cover additional HF architectures (e.g., OLMo3, SmolLM3, Qwen3/Qwen3.5, Nemotron-H, Gemma3) and adds a comprehensive “fast” test suite that builds tiny random-weight models offline and validates ONNX vs PyTorch (and/or validates export output). It also adds a compatibility shim to post-process genai_config.json for transformers>=5 generation defaults.

Changes:

Add many new offline “fast” unit tests that export minimal random-weight models and compare ORT outputs against PyTorch (prefill/decode and some greedy generation).
Extend/fix multiple builder implementations (OLMo*, SmolLM, Nemotron/Nemotron-H, Gemma3, Ernie, GPT-OSS, etc.) to match architecture quirks and transformers v4/v5 config differences.
Add fix_genai_config utility + tests, relax pytest --test_models option requirement, and introduce a GitHub Actions workflow for fast tests.

Reviewed changes

Copilot reviewed 52 out of 53 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
test/python/models/fast/test_random_whisper.py	Adds offline Whisper encoder/decoder export + ORT vs PyTorch discrepancy checks.
test/python/models/fast/test_random_smollm3.py	Adds SmolLM3 random-weight discrepancy tests and greedy generation checks.
test/python/models/fast/test_random_qwen3_vl.py	Adds Qwen3-VL text-path discrepancy tests (transformers>=5 guarded).
test/python/models/fast/test_random_qwen3_5.py	Adds Qwen3.5 full-attention execution test + hybrid build validation (transformers>=5).
test/python/models/fast/test_random_qwen3_0_6b.py	Adds Qwen3 random-weight discrepancy and greedy generation tests.
test/python/models/fast/test_random_qwen2_5_vl.py	Adds Qwen2.5-VL text-path discrepancy tests (transformers>=5 guarded).
test/python/models/fast/test_random_phi4mm.py	Adds Phi4MM synthetic-weight build + ORT vs PyTorch comparisons (uses PEFT LoRA).
test/python/models/fast/test_random_phi3.py	Adds Phi-3 random-weight discrepancy + greedy generation tests.
test/python/models/fast/test_random_phi.py	Adds Phi-2 random-weight discrepancy + greedy generation tests.
test/python/models/fast/test_random_olmo3.py	Adds OLMo3 random-weight discrepancy + greedy generation tests.
test/python/models/fast/test_random_olmo2.py	Adds OLMo2 random-weight discrepancy + greedy generation tests.
test/python/models/fast/test_random_olmo.py	Adds OLMo (v1) random-weight discrepancy + greedy generation tests.
test/python/models/fast/test_random_nemotron.py	Adds Nemotron random-weight discrepancy + greedy generation tests.
test/python/models/fast/test_random_nemotron_h.py	Adds Nemotron-H random-weight discrepancy + greedy generation tests (transformers>=5 guarded).
test/python/models/fast/test_random_granite.py	Adds Granite random-weight discrepancy + greedy generation tests.
test/python/models/fast/test_random_gemma3.py	Adds Gemma3 random-weight discrepancy + greedy generation tests (transformers>=5 guarded).
test/python/models/fast/test_random_gemma.py	Adds Gemma random-weight discrepancy + greedy generation tests.
test/python/models/fast/test_onnx_generate.py	Adds unit tests for ONNX greedy generation helper (no-cache and KV-cache paths).
test/python/models/fast/test_llama_attention_discrepancies.py	Adds attention-only Llama export + numeric discrepancy tests.
test/python/models/fast/test_fix_genai_config.py	Adds tests validating `fix_genai_config` behavior for null generation defaults.
test/python/models/fast/test_check_extra_options.py	Adds tests for extra_options parsing/validation behavior.
test/python/conftest.py	Makes `--test_models` optional for pytest runs.
src/python/py/models/genai_config_utils.py	Introduces `fix_genai_config` to fill null `search` defaults for transformers>=5.
src/python/py/models/builders/smollm.py	Fixes per-layer RoPE enable/disable handling for SmolLM.
src/python/py/models/builders/phi.py	Minor formatting tweak around `position_ids` reformatting naming.
src/python/py/models/builders/olmo.py	Adds OLMo2/OLMo3 builders + OLMo v1 LayerNorm handling + OLMo2/3 q/k norm + post-norm residual flow.
src/python/py/models/builders/nemotron.py	Adds Nemotron-H builder and fixes Nemotron epsilon handling.
src/python/py/models/builders/internlm.py	Formatting/quoting cleanup (no apparent logic change in diff shown).
src/python/py/models/builders/gptoss.py	Fixes RoPE cache inv_freq math and tightens some style/robustness in bias combining.
src/python/py/models/builders/gemma.py	Improves Gemma3 RoPE parameter handling (v4/v5) and adds conditional-generation load_weights path.
src/python/py/models/builders/ernie.py	Patches rope_theta extraction from `rope_parameters` for Ernie4.5 configs.
src/python/py/models/builders/init.py	Re-exports `fix_genai_config` and narrows `__all__`.
.gitignore	Ignores `dump_models/` and `stats/` produced by fast tests.
.github/workflows/modelbuilder_fast_tests.yml	Adds a GitHub Actions workflow to run fast modelbuilder tests across transformers versions.

…n overriding method' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

… in attention test Agent-Logs-Url: https://github.com/microsoft/onnxruntime-genai/sessions/76ce540d-7d1c-44e8-9035-a1230aeaaddd Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 53 out of 54 changed files in this pull request and generated 15 comments.

Copilot · 2026-04-10T12:09:21Z

+        model = AutoModelForCausalLM.from_pretrained(model_id, ignore_mismatched_sizes=True, dtype=dtype)
+        model.eval().to(provider).to(dtype)


AutoModelForCausalLM.from_pretrained(...) does not accept a dtype= kwarg in the Transformers API used elsewhere in this repo (it uses torch_dtype=). As written this will raise a TypeError and the long test can't run.

Copilot · 2026-04-10T12:09:21Z

+                precision=precision,
+                model_id=model_id,
+                experiment="forward",
+                provider="cuda",
+                test=f"test_trained_{smodel_id}_discrepancies_{precision}_{provider}",
+                input_type="text",


The discrepancy metadata hard-codes provider="cuda" even when the test is executed with provider="cpu", which will mislabel results and make debugging harder. Use the provider parameter value instead.

Copilot · 2026-04-10T12:09:22Z

+        prompt_len = torch_feed["input_ids"].shape[1]
+        with torch.no_grad():
+            pt_output = model.generate(**torch_feed, max_new_tokens=max_new_tokens, do_sample=False, pad_token_id=tokenizer.eos_token_id)
+        # Keep only the newly generated tokens (exclude the prompt).
+        pt_tokens = pt_output[0][prompt_len:].tolist()
+
+        # ------------------------------------------------------------------
+        # onnxruntime-genai greedy generation
+        # ------------------------------------------------------------------
+        og_model = og.Model(os.path.dirname(onnx_path))
+
+        params = og.GeneratorParams(og_model)
+        params.set_search_options(do_sample=False, max_length=max_new_tokens, temperature=1.0, top_k=1)
+
+        generator = og.Generator(og_model, params)


onnxruntime-genai's max_length is the total sequence length (prompt + generated). This sets max_length=max_new_tokens, which will stop generation early for non-empty prompts and can cause token mismatches vs transformers.generate(max_new_tokens=...). Set max_length=prompt_len + max_new_tokens (or use a max_new_tokens option if available).

Copilot · 2026-04-10T12:09:22Z

+"""
+This test file is ignore unless you set ``LONGTEST=1``.
+It verifies a pretrained text model, discrepancies at prefill and decoding steps.


Spelling/grammar: "This test file is ignore" → "This test file is ignored".

Copilot · 2026-04-10T12:09:22Z

+        onnx_path = os.path.join(output_dir, "model.onnx")
+        self.assertExists(onnx_path)
+        sess = self.check_ort(onnx_path)
+


This test builds the model with execution_provider=provider (including CUDA), but uses self.check_ort(...) which forces a CPU-only InferenceSession. For CUDA subtests this won't exercise the CUDA EP (and can hide CUDA-only issues). Use _check_with_ort(onnx_path, cpu=provider == "cpu") instead.

Copilot · 2026-04-10T12:09:24Z

+        onnx_path = os.path.join(output_dir, "model.onnx")
+        self.assertExists(onnx_path)
+        sess = self.check_ort(onnx_path)
+


This test builds the model with execution_provider=provider (including CUDA), but uses self.check_ort(...) which forces a CPU-only InferenceSession. For CUDA subtests this won't exercise the CUDA EP. Use _check_with_ort(onnx_path, cpu=provider == "cpu") instead.

kunal-vaishnavi · 2026-04-13T21:48:01Z

+      - name: Install nightly pytorch (cpu)
+        if: matrix.torch-version == 'nightly'
+        run: pip install --pre --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
+


Can we use the requirements.txt files in test/python per device target instead of specifying packages here?

we can. I installed torch cpu first to avoid pip try the gpu version which takes a while to install.

kunal-vaishnavi · 2026-04-13T21:50:36Z

+
+      - name: Upload stats artifact
+        if: always()
+        uses: actions/upload-artifact@v4


Is there a way we can have a bot that adds these results as a comment on the PRs as well? Then, we don't have to download each zip file on every run.

@copilot can you suggest something?

Added in ccee173. After the artifact upload, a new Post test results as PR comment step uses actions/github-script to read all CSV files from stats/, convert them to collapsible markdown tables, and create (or update) a PR comment. Each matrix combination gets a unique HTML marker so re-runs update the existing comment instead of creating duplicates. The job also needs pull-requests: write permission, which was added.

kunal-vaishnavi · 2026-04-13T22:31:19Z

+
        onnx_model = GPTOSSModel(config, io_dtype, onnx_dtype, execution_provider, cache_dir, extra_options)
    elif config.architectures[0] == "GraniteForCausalLM":
+        from .builders.granite import GraniteModel


What is the motivation for moving the imports here?

To use multiple version of transformers, 4.57, 5.5. Some models are not available in 4.57. This way it is importing what it needs when it needs it.

The local imports inside each elif block were already present in the original code — this PR didn't move them. They avoid loading all model builder modules at import time; only the relevant one is loaded when a specific architecture is requested.

kunal-vaishnavi · 2026-04-13T22:33:02Z

+            "input_ids": ir.DataType.INT64,  # For standard models
+            "attention_mask": ir.DataType.INT64,  # For standard models
+            "position_ids": ir.DataType.INT64,  # For standard models
+            "inputs_embeds": self.io_dtype,  # For standard models where you want to remove the embedding layer from the model (note that `inputs_embeds` is written this way to match Hugging Face format)


Can we revert the spacing changes in this file? These spacing changes were there before from an automatic linter, and many of the spacing changes make it harder to read/compare lines.

I should be able to do that. But it would be great to have some rules I could use to normalize the style. Why not merging a PR changing the style and only the style, then merge this one?

Restored in 0b169f8. All the attribute dicts in __init__ (input_types, input_shapes, output_names, output_types, output_shapes, mask_attrs, embed_attrs, layernorm_attrs, rope_attrs, attention_attrs, mlp_attrs, moe_attrs, lm_head_attrs) and other long-line single-expressions (window_size, intermediate_size, num_kv_heads, activation, graph, attn_softcap, kernel_block_size, lm_head_softcap) were reverted to the original column-aligned style.

kunal-vaishnavi · 2026-04-13T22:36:50Z

+    :param genai_config: genai_config dict (modified in-place and returned).
+    :return: the modified *genai_config* dict.
+    """
+    search = genai_config.get("search", {})


The following try-except block is attempting to identify default values.

onnxruntime-genai/src/python/py/models/builders/base.py

Lines 547 to 566 in 883a04c

try:

# Override search attributes in config based on values in generation_config.json

gen_config = GenerationConfig.from_pretrained(

model_name_or_path, token=self.hf_token, trust_remote_code=self.hf_remote, **extra_kwargs

)

defaults = {

"bos_token_id": None,

"do_sample": False,

"eos_token_id": None,

"pad_token_id": None,

"temperature": 1.0,

"top_k": 50,

"top_p": 1.0,

}

for key, default_val in defaults.items():

val = getattr(gen_config, key)

if val is not None and val != default_val:

setattr(config, key, getattr(gen_config, key))

except:

pass

Can we modify this inside make_genai_config instead of making a new fix_genai_config method? We can initialize GENAI_SEARCH_DEFAULTS inside make_genai_config.

kunal-vaishnavi · 2026-04-13T22:40:41Z

+                # HuggingFace YaRN formula:
+                # final_mscale = yarn_get_mscale(factor, mscale) / yarn_get_mscale(factor, mscale_all_dim)
+                # where yarn_get_mscale(s, m) = 0.1 * m * log(s) + 1 if s > 1 else 1
+                def _yarn_get_mscale(s, m):


Can we modify the existing make_mscale methods or add a new make_* method to perform the new computations rather than nesting this here?

kunal-vaishnavi · 2026-04-13T22:41:21Z

    def is_gqa_supported(self) -> bool:
        valid_gqa_configurations = {
            ("cpu", ir.DataType.FLOAT),
+            ("cpu", ir.DataType.FLOAT16),


Many FP16 CPU kernels are not implemented. This will cause a performance regression for users. Can we remove this?

Removed in 0b169f8. ("cpu", ir.DataType.FLOAT16) was removed from is_gqa_supported() with an explanatory comment. Many FP16 CPU GQA kernels are unimplemented so including this configuration would cause a performance regression.

kunal-vaishnavi · 2026-04-13T23:52:40Z

+        # OLMo v1 uses standard LayerNorm (with mean subtraction), not RMSNorm.
+        self.layernorm_attrs["simple"] = False
+        # OlmoLayerNorm hard-codes eps=1e-5 and OlmoConfig has no rms_norm_eps/layer_norm_eps
+        # attribute, so override the epsilon that the base class defaulted to 1e-6.


The default here does not match the comment. Can we move this parsing into the base class so it is done automatically?

kunal-vaishnavi · 2026-04-13T23:53:42Z


    def make_layernorm(self, layer_id, layernorm, skip, simple, location):
+        # OlmoLayerNorm has no learnable weight or bias; inject identity values so
+        # the base class make_layernorm_op can create the initializer tensors it needs.


If there are no weights or biases to set, do we even need to create the LayerNorm ops? Could we override make_layer to not create such ops?

kunal-vaishnavi · 2026-04-14T01:22:33Z


        # For non-quantized models, load the Hugging Face model
        print("Loading Qwen2_5_VLForConditionalGeneration model...")
+        from transformers import Qwen2_5_VLForConditionalGeneration


Why remove the global imports and use local imports?

To handle multiple versions of transformers.

These local imports inside load_weights (e.g. from transformers import Qwen2_5_VLForConditionalGeneration) were already present in the original code — this PR didn't move them. They're kept local to avoid importing heavy VLM classes at module load time unless that specific model class is actually used.

kunal-vaishnavi · 2026-04-14T01:24:27Z

        # GQA is not supported in Whisper since there is no attention mask input
        return False

+    def is_packed_attn_supported(self):


Packed attention is supported in the encoder. You can download the ONNX models from the Foundry Local catalog to see what they look like.

kunal-vaishnavi · 2026-04-14T01:25:33Z

-            "hidden_states": ["batch_size", self.max_source_positions, self.hidden_size],  # ['batch_size', 'num_frames / 2', 'hidden_size']
-            "present_key_cross": ["batch_size", self.num_attn_heads, self.max_source_positions, self.head_size],  # ['batch_size', 'num_heads', 'num_frames / 2', 'head_size']
-            "present_value_cross": ["batch_size", self.num_attn_heads, self.max_source_positions, self.head_size]  # ['batch_size', 'num_heads', 'num_frames / 2', 'head_size']
+            "hidden_states": [


Same comment as here

kunal-vaishnavi · 2026-04-14T01:25:51Z

        gelu_1_name = f"{basename}/Gelu_1"
        gelu_1_output = f"{gelu_1_name}/output_0"
-        self.make_node("Gelu", inputs=[f"{conv_1_name}/output_0"], outputs=[gelu_1_output], name=gelu_1_name, approximate="none")
+        self.make_node("Gelu", inputs=[f"{conv_1_name}/output_0"], outputs=[gelu_1_output], name=gelu_1_name)


Same comment as here

kunal-vaishnavi · 2026-04-14T01:26:03Z

        gelu_2_output = f"{gelu_2_name}/output_0"
-        self.make_node("Gelu", inputs=[f"{conv_2_name}/output_0"], outputs=[gelu_2_output], name=gelu_2_name, approximate="none")
-        self.make_value(gelu_2_output, dtype=self.io_dtype, shape=["batch_size", self.hidden_size, self.max_source_positions])
+        self.make_node("Gelu", inputs=[f"{conv_2_name}/output_0"], outputs=[gelu_2_output], name=gelu_2_name)


Same comment as here

kunal-vaishnavi · 2026-04-14T01:34:00Z

+        from .builders.gemma import Gemma3Model
+
        onnx_model = Gemma3Model(config, io_dtype, onnx_dtype, execution_provider, cache_dir, extra_options)
+        onnx_model.model_type = "gemma3_text"


This should be reverted. The model type for the VLM is gemma3, not gemma3_text, because the model type is derived from the architecture name in the config. The LLM uses gemma3_text since it is text-only. This VLM has image and text support so it uses gemma3 as the model type instead (see here for reference).

Fixed in 0b169f8. Changed onnx_model.model_type = "gemma3_text" to onnx_model.model_type = "gemma3" for the Gemma3ForConditionalGeneration (VLM) case.

kunal-vaishnavi · 2026-04-14T01:35:26Z

    fp32_webgpu = execution_provider == "webgpu" and extra_options.get("use_webgpu_fp32", False)
-    bf16_cuda = precision == "int4" and execution_provider in {"cuda", "trt-rtx"} and extra_options.get("use_cuda_bf16", False)
+    bf16_cuda = (
+        precision == "int4" and execution_provider in {"cuda", "trt-rtx"} and extra_options.get("use_cuda_bf16", False)


Same comment as here

kunal-vaishnavi · 2026-04-14T01:36:05Z

+
+
+class PvVersion:
+    """Simple version of packaging.version.Version."""


Why can't we import packaging and use it directly?

kunal-vaishnavi · 2026-04-14T01:38:14Z

+            shutil.rmtree(path)
+
+    def get_dirs(self, prefix: str, clean: bool = True) -> tuple[str]:
+        output_dir = f"dump_models/{prefix}/output"


Can we use path utils to create the path rather than hardcoding a string (e.g. os.path.join or Path(...) instead) for cross-platform compatibility?

kunal-vaishnavi · 2026-04-14T01:42:27Z

+    return lambda x: x
+
+
+class ExtTestCase(unittest.TestCase):


What is ext an abbreviation for? Can we use the full name and rename the file accordingly?

kunal-vaishnavi · 2026-04-14T01:45:58Z

+        if not full.startswith(prefix):
+            raise AssertionError(f"prefix={prefix!r} does not start string {full!r}.")
+
+    def check_ort(self, onx: Union["onnx.ModelProto", str]) -> "onnxruntime.InferenceSession":  # noqa: F821  # noqa: F821


Why test with ORT directly? Rather than test models with ORT, let's rewrite this file to test with ORT GenAI directly. That will test ORT as well.

kunal-vaishnavi · 2026-04-14T01:49:11Z

+_MODEL_ID = "arnir0/Tiny-LLM"
+
+
+class _AttentionOnlyLlamaModel(LlamaModel):


Should we create a 1-layer ONNX model and a 1-layer torch model to compare instead of attention-only? Then, we know each layer is being created correctly. In that way, we can also use ORT GenAI for the comparison instead of direct ORT.

kunal-vaishnavi · 2026-04-14T01:51:51Z

+
+
+def _make_tiny_lm_no_cache(fixed_token: int = 3) -> onnx.ModelProto:
+    """Returns a minimal ONNX LM with no KV cache.


Given we don't support models without KV cache inputs/outputs and given we can pass in a 0-filled KV cache to essentially not use a KV cache, I'm not sure how much gain we get from this test. Can we remove this?

kunal-vaishnavi · 2026-04-14T01:55:04Z

+# Minimal ChatGLM model source code saved to the model directory so that
+# AutoModelForCausalLM.from_pretrained(..., trust_remote_code=True) can load
+# the model completely offline during fast tests.
+_MODELING_CODE = """\


Can we instead pre-install a version of transformers inside our VM images so we don't have to save the source code? Then, we read the modeling file from the local installation. It will be tedious to have to update this modeling code each time there's a change on Hugging Face's interfaces with newer transformers versions.

kunal-vaishnavi · 2026-04-14T01:57:53Z

+    return model, config
+
+
+class TestChatGLM(ExtTestCase):


There are a lot of modeling files for testing so I won't write the same comment on each one. But we should use ORT GenAI for testing instead of direct ORT since ORT GenAI will test both.

kunal-vaishnavi · 2026-04-14T02:00:36Z

+
+import numpy as np
+from ext_test_case import ExtTestCase, hide_stdout, requires_cuda, run_session_or_io_binding
+


All of these LLMs have the same input/output structure when using ORT GenAI. Can we consolidate all of these files into a generic test_random_llm.py file instead? That would greatly reduce the amount of code to review, test, update, etc.

kunal-vaishnavi · 2026-04-14T02:01:22Z



 def get_path_for_model(data_path, model_name, precision, device):
+    if data_path is None:


Can we instead move the generated torch models to be stored inside the location in test_models? The CIs are heavily dependent on this flag existing.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>

… FP16 CPU GQA, add PR comment step Agent-Logs-Url: https://github.com/microsoft/onnxruntime-genai/sessions/b79ebc36-b773-4f75-9f29-9f8dd5d65330 Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com>

…kflow Agent-Logs-Url: https://github.com/microsoft/onnxruntime-genai/sessions/b79ebc36-b773-4f75-9f29-9f8dd5d65330 Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com>

xadupre added 2 commits April 10, 2026 12:26

first commit to update modelbuilder

43c7e8e

lint

bd2366e

Copilot AI review requested due to automatic review settings April 10, 2026 10:28

github-advanced-security AI found potential problems Apr 10, 2026

View reviewed changes

Comment thread src/python/py/models/builders/qwen.py Dismissed

Comment thread test/python/models/fast/ext_test_case.py Fixed

Comment thread test/python/models/fast/test_random_phimoe.py Dismissed

dependency

fb7f0c3

Copilot started reviewing on behalf of xadupre April 10, 2026 10:33 View session

xadupre and others added 2 commits April 10, 2026 12:38

fix import issues

b6a8cc8

Potential fix for pull request finding 'CodeQL / Empty except'

4306fb1

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Copilot AI reviewed Apr 10, 2026

View reviewed changes

Comment thread test/python/models/fast/test_random_phi4mm.py

Comment thread test/python/models/fast/test_llama_attention_discrepancies.py Outdated

Comment thread test/python/conftest.py

Comment thread .github/workflows/modelbuilder_fast_tests.yml Outdated

xadupre added 2 commits April 10, 2026 12:42

fix

0518fde

fix import issues

d6950e8

github-advanced-security AI found potential problems Apr 10, 2026

View reviewed changes

Comment thread test/python/models/fast/test_llama_attention_discrepancies.py Fixed

ls

19d40c2

Copilot started work on behalf of xadupre April 10, 2026 11:27 View session

Potential fix for pull request finding 'CodeQL / Signature mismatch i…

04cfac0

…n overriding method' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

xadupre commented Apr 10, 2026

View reviewed changes

Comment thread .github/workflows/modelbuilder_fast_tests.yml Outdated

fix: handle None data_path in conftest, fix CodeQL signature mismatch…

bf8f708

… in attention test Agent-Logs-Url: https://github.com/microsoft/onnxruntime-genai/sessions/76ce540d-7d1c-44e8-9035-a1230aeaaddd Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com>

Copilot finished work on behalf of xadupre April 10, 2026 11:33

lint

52364b4

xadupre requested review from Copilot and kunal-vaishnavi April 10, 2026 11:55

Copilot started reviewing on behalf of xadupre April 10, 2026 11:59 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

kunal-vaishnavi reviewed Apr 13, 2026

View reviewed changes

kunal-vaishnavi reviewed Apr 14, 2026

View reviewed changes

xadupre and others added 4 commits April 14, 2026 19:20

Update test/python/models/fast/test_random_granite.py

e9b5000

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update test/python/models/fast/test_random_gemma.py

c0bccd4

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update test/python/models/fast/test_random_gemma3.py

d41f845

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update .github/workflows/modelbuilder_fast_tests.yml

b787806

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot started work on behalf of xadupre April 14, 2026 17:23 View session

xadupre and others added 3 commits April 14, 2026 19:26

Update src/python/py/models/builders/base.py

5d6f67e

Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>

fix: add comment for FP16 CPU GQA exclusion, rename config var in wor…

ccee173

…kflow Agent-Logs-Url: https://github.com/microsoft/onnxruntime-genai/sessions/b79ebc36-b773-4f75-9f29-9f8dd5d65330 Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com>

Copilot finished work on behalf of xadupre April 14, 2026 18:11

titaiwangms self-requested a review April 14, 2026 23:32

kunal-vaishnavi mentioned this pull request Apr 15, 2026

Fix YaRN RoPE bugs in model builder and add parity tests #2076

Open

		model = AutoModelForCausalLM.from_pretrained(model_id, ignore_mismatched_sizes=True, dtype=dtype)
		model.eval().to(provider).to(dtype)

	try:
	# Override search attributes in config based on values in generation_config.json
	gen_config = GenerationConfig.from_pretrained(
	model_name_or_path, token=self.hf_token, trust_remote_code=self.hf_remote, **extra_kwargs
	)
	defaults = {
	"bos_token_id": None,
	"do_sample": False,
	"eos_token_id": None,
	"pad_token_id": None,
	"temperature": 1.0,
	"top_k": 50,
	"top_p": 1.0,
	}
	for key, default_val in defaults.items():
	val = getattr(gen_config, key)
	if val is not None and val != default_val:
	setattr(config, key, getattr(gen_config, key))
	except:
	pass



		class PvVersion:
		"""Simple version of packaging.version.Version."""

		_MODEL_ID = "arnir0/Tiny-LLM"


		class _AttentionOnlyLlamaModel(LlamaModel):



		def _make_tiny_lm_no_cache(fixed_token: int = 3) -> onnx.ModelProto:
		"""Returns a minimal ONNX LM with no KV cache.


		import numpy as np
		from ext_test_case import ExtTestCase, hide_stdout, requires_cuda, run_session_or_io_binding



		def get_path_for_model(data_path, model_name, precision, device):
		if data_path is None:

Conversation

xadupre commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

xadupre commented Apr 10, 2026 •

edited

Loading