Skip to content

extend modelbuilder to build Olmo3, SmolLM3 and other models#2078

Open
xadupre wants to merge 18 commits intomainfrom
xadupre/mbext
Open

extend modelbuilder to build Olmo3, SmolLM3 and other models#2078
xadupre wants to merge 18 commits intomainfrom
xadupre/mbext

Conversation

@xadupre
Copy link
Copy Markdown
Member

@xadupre xadupre commented Apr 10, 2026

  • it adds unit test to check discrepancies on all supported architectures
  • it fixes a few things in existing architectures detected through the unit tests
  • it supports transformers>=5 by modifying the final genai config to make it lookl like a transformers<5
  • it adds a job to run the created models on cpu with the released version of onnxruntime-genai
  • the tests runs in 3 or 4 minutes on two versions of transformers (4.57, 5.5)
  • it uploads artifact showing the test results (below an example)
model_id experiment precision provider input_type max_abs_err avg_abs_discrepancy dnan next_token next_token_id_tch next_token_id_ort test first_diff total_diff kind step %>0.1 %>0.01
0 arnir0/Tiny-LLM attention_decode fp32 cpu text 1.3411e-07 3.00374e-08 0 OK 324 324 test_llama_attention_decode_discrepancies 0 0
1 arnir0/Tiny-LLM attention_prefill fp32 cpu text 3.27826e-07 4.81524e-08 0 OK 155 155 test_llama_attention_prefill_discrepancies 0 0
2 THUDM/chatglm3-6b generate fp16 cpu text test_generation_chatglm_fp16_cpu 0 fast
3 THUDM/chatglm3-6b generate fp32 cpu text test_generation_chatglm_fp32_cpu 0 fast
4 THUDM/chatglm3-6b forward fp16 cpu text 0.000906795 0.000173806 0 OK 905 905 test_discrepancies_chatglm_fp16_cpu fast prefill 0 0
5 THUDM/chatglm3-6b forward fp16 cpu text 0.000891805 0.00018833 0 OK 1322 1322 test_discrepancies_chatglm_fp16_cpu fast decode 0 0
6 THUDM/chatglm3-6b forward fp32 cpu text 8.34465e-07 1.66793e-07 0 OK 905 905 test_discrepancies_chatglm_fp32_cpu fast prefill 0 0
7 THUDM/chatglm3-6b forward fp32 cpu text 7.7486e-07 1.58349e-07 0 OK 1322 1322 test_discrepancies_chatglm_fp32_cpu fast decode 0 0

Copilot AI review requested due to automatic review settings April 10, 2026 10:28
Comment thread src/python/py/models/builders/qwen.py Dismissed
Comment thread test/python/models/fast/ext_test_case.py Fixed
Comment thread test/python/models/fast/test_random_phimoe.py Dismissed
xadupre and others added 2 commits April 10, 2026 12:38
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the Python modelbuilder to cover additional HF architectures (e.g., OLMo3, SmolLM3, Qwen3/Qwen3.5, Nemotron-H, Gemma3) and adds a comprehensive “fast” test suite that builds tiny random-weight models offline and validates ONNX vs PyTorch (and/or validates export output). It also adds a compatibility shim to post-process genai_config.json for transformers>=5 generation defaults.

Changes:

  • Add many new offline “fast” unit tests that export minimal random-weight models and compare ORT outputs against PyTorch (prefill/decode and some greedy generation).
  • Extend/fix multiple builder implementations (OLMo*, SmolLM, Nemotron/Nemotron-H, Gemma3, Ernie, GPT-OSS, etc.) to match architecture quirks and transformers v4/v5 config differences.
  • Add fix_genai_config utility + tests, relax pytest --test_models option requirement, and introduce a GitHub Actions workflow for fast tests.

Reviewed changes

Copilot reviewed 52 out of 53 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/python/models/fast/test_random_whisper.py Adds offline Whisper encoder/decoder export + ORT vs PyTorch discrepancy checks.
test/python/models/fast/test_random_smollm3.py Adds SmolLM3 random-weight discrepancy tests and greedy generation checks.
test/python/models/fast/test_random_qwen3_vl.py Adds Qwen3-VL text-path discrepancy tests (transformers>=5 guarded).
test/python/models/fast/test_random_qwen3_5.py Adds Qwen3.5 full-attention execution test + hybrid build validation (transformers>=5).
test/python/models/fast/test_random_qwen3_0_6b.py Adds Qwen3 random-weight discrepancy and greedy generation tests.
test/python/models/fast/test_random_qwen2_5_vl.py Adds Qwen2.5-VL text-path discrepancy tests (transformers>=5 guarded).
test/python/models/fast/test_random_phi4mm.py Adds Phi4MM synthetic-weight build + ORT vs PyTorch comparisons (uses PEFT LoRA).
test/python/models/fast/test_random_phi3.py Adds Phi-3 random-weight discrepancy + greedy generation tests.
test/python/models/fast/test_random_phi.py Adds Phi-2 random-weight discrepancy + greedy generation tests.
test/python/models/fast/test_random_olmo3.py Adds OLMo3 random-weight discrepancy + greedy generation tests.
test/python/models/fast/test_random_olmo2.py Adds OLMo2 random-weight discrepancy + greedy generation tests.
test/python/models/fast/test_random_olmo.py Adds OLMo (v1) random-weight discrepancy + greedy generation tests.
test/python/models/fast/test_random_nemotron.py Adds Nemotron random-weight discrepancy + greedy generation tests.
test/python/models/fast/test_random_nemotron_h.py Adds Nemotron-H random-weight discrepancy + greedy generation tests (transformers>=5 guarded).
test/python/models/fast/test_random_granite.py Adds Granite random-weight discrepancy + greedy generation tests.
test/python/models/fast/test_random_gemma3.py Adds Gemma3 random-weight discrepancy + greedy generation tests (transformers>=5 guarded).
test/python/models/fast/test_random_gemma.py Adds Gemma random-weight discrepancy + greedy generation tests.
test/python/models/fast/test_onnx_generate.py Adds unit tests for ONNX greedy generation helper (no-cache and KV-cache paths).
test/python/models/fast/test_llama_attention_discrepancies.py Adds attention-only Llama export + numeric discrepancy tests.
test/python/models/fast/test_fix_genai_config.py Adds tests validating fix_genai_config behavior for null generation defaults.
test/python/models/fast/test_check_extra_options.py Adds tests for extra_options parsing/validation behavior.
test/python/conftest.py Makes --test_models optional for pytest runs.
src/python/py/models/genai_config_utils.py Introduces fix_genai_config to fill null search defaults for transformers>=5.
src/python/py/models/builders/smollm.py Fixes per-layer RoPE enable/disable handling for SmolLM.
src/python/py/models/builders/phi.py Minor formatting tweak around position_ids reformatting naming.
src/python/py/models/builders/olmo.py Adds OLMo2/OLMo3 builders + OLMo v1 LayerNorm handling + OLMo2/3 q/k norm + post-norm residual flow.
src/python/py/models/builders/nemotron.py Adds Nemotron-H builder and fixes Nemotron epsilon handling.
src/python/py/models/builders/internlm.py Formatting/quoting cleanup (no apparent logic change in diff shown).
src/python/py/models/builders/gptoss.py Fixes RoPE cache inv_freq math and tightens some style/robustness in bias combining.
src/python/py/models/builders/gemma.py Improves Gemma3 RoPE parameter handling (v4/v5) and adds conditional-generation load_weights path.
src/python/py/models/builders/ernie.py Patches rope_theta extraction from rope_parameters for Ernie4.5 configs.
src/python/py/models/builders/init.py Re-exports fix_genai_config and narrows __all__.
.gitignore Ignores dump_models/ and stats/ produced by fast tests.
.github/workflows/modelbuilder_fast_tests.yml Adds a GitHub Actions workflow to run fast modelbuilder tests across transformers versions.

Comment thread test/python/models/fast/test_random_phi4mm.py
Comment thread test/python/models/fast/test_llama_attention_discrepancies.py Outdated
Comment thread test/python/conftest.py
Comment thread .github/workflows/modelbuilder_fast_tests.yml Outdated
Comment thread test/python/models/fast/test_llama_attention_discrepancies.py Fixed
…n overriding method'

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Comment thread .github/workflows/modelbuilder_fast_tests.yml Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 53 out of 54 changed files in this pull request and generated 15 comments.

Comment on lines +48 to +49
model = AutoModelForCausalLM.from_pretrained(model_id, ignore_mismatched_sizes=True, dtype=dtype)
model.eval().to(provider).to(dtype)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoModelForCausalLM.from_pretrained(...) does not accept a dtype= kwarg in the Transformers API used elsewhere in this repo (it uses torch_dtype=). As written this will raise a TypeError and the long test can't run.

Copilot uses AI. Check for mistakes.
Comment on lines +79 to +84
precision=precision,
model_id=model_id,
experiment="forward",
provider="cuda",
test=f"test_trained_{smodel_id}_discrepancies_{precision}_{provider}",
input_type="text",
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discrepancy metadata hard-codes provider="cuda" even when the test is executed with provider="cpu", which will mislabel results and make debugging harder. Use the provider parameter value instead.

Copilot uses AI. Check for mistakes.
Comment on lines +119 to +133
prompt_len = torch_feed["input_ids"].shape[1]
with torch.no_grad():
pt_output = model.generate(**torch_feed, max_new_tokens=max_new_tokens, do_sample=False, pad_token_id=tokenizer.eos_token_id)
# Keep only the newly generated tokens (exclude the prompt).
pt_tokens = pt_output[0][prompt_len:].tolist()

# ------------------------------------------------------------------
# onnxruntime-genai greedy generation
# ------------------------------------------------------------------
og_model = og.Model(os.path.dirname(onnx_path))

params = og.GeneratorParams(og_model)
params.set_search_options(do_sample=False, max_length=max_new_tokens, temperature=1.0, top_k=1)

generator = og.Generator(og_model, params)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onnxruntime-genai's max_length is the total sequence length (prompt + generated). This sets max_length=max_new_tokens, which will stop generation early for non-empty prompts and can cause token mismatches vs transformers.generate(max_new_tokens=...). Set max_length=prompt_len + max_new_tokens (or use a max_new_tokens option if available).

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +8
"""
This test file is ignore unless you set ``LONGTEST=1``.
It verifies a pretrained text model, discrepancies at prefill and decoding steps.
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling/grammar: "This test file is ignore" → "This test file is ignored".

Copilot uses AI. Check for mistakes.
Comment on lines +76 to +79
onnx_path = os.path.join(output_dir, "model.onnx")
self.assertExists(onnx_path)
sess = self.check_ort(onnx_path)

Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test builds the model with execution_provider=provider (including CUDA), but uses self.check_ort(...) which forces a CPU-only InferenceSession. For CUDA subtests this won't exercise the CUDA EP (and can hide CUDA-only issues). Use _check_with_ort(onnx_path, cpu=provider == "cpu") instead.

Copilot uses AI. Check for mistakes.
Comment on lines +78 to +81
onnx_path = os.path.join(output_dir, "model.onnx")
self.assertExists(onnx_path)
sess = self.check_ort(onnx_path)

Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test builds the model with execution_provider=provider (including CUDA), but uses self.check_ort(...) which forces a CPU-only InferenceSession. For CUDA subtests this won't exercise the CUDA EP. Use _check_with_ort(onnx_path, cpu=provider == "cpu") instead.

Copilot uses AI. Check for mistakes.
Comment thread test/python/models/fast/test_random_granite.py Outdated
Comment thread test/python/models/fast/test_random_gemma.py Outdated
Comment thread test/python/models/fast/test_random_gemma3.py Outdated
Comment thread .github/workflows/modelbuilder_fast_tests.yml Outdated
- name: Install nightly pytorch (cpu)
if: matrix.torch-version == 'nightly'
run: pip install --pre --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the requirements.txt files in test/python per device target instead of specifying packages here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can. I installed torch cpu first to avoid pip try the gpu version which takes a while to install.


- name: Upload stats artifact
if: always()
uses: actions/upload-artifact@v4
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can have a bot that adds these results as a comment on the PRs as well? Then, we don't have to download each zip file on every run.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot can you suggest something?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in ccee173. After the artifact upload, a new Post test results as PR comment step uses actions/github-script to read all CSV files from stats/, convert them to collapsible markdown tables, and create (or update) a PR comment. Each matrix combination gets a unique HTML marker so re-runs update the existing comment instead of creating duplicates. The job also needs pull-requests: write permission, which was added.


onnx_model = GPTOSSModel(config, io_dtype, onnx_dtype, execution_provider, cache_dir, extra_options)
elif config.architectures[0] == "GraniteForCausalLM":
from .builders.granite import GraniteModel
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for moving the imports here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use multiple version of transformers, 4.57, 5.5. Some models are not available in 4.57. This way it is importing what it needs when it needs it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local imports inside each elif block were already present in the original code — this PR didn't move them. They avoid loading all model builder modules at import time; only the relevant one is loaded when a specific architecture is requested.

Comment thread src/python/py/models/builders/base.py Outdated
"input_ids": ir.DataType.INT64, # For standard models
"attention_mask": ir.DataType.INT64, # For standard models
"position_ids": ir.DataType.INT64, # For standard models
"inputs_embeds": self.io_dtype, # For standard models where you want to remove the embedding layer from the model (note that `inputs_embeds` is written this way to match Hugging Face format)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we revert the spacing changes in this file? These spacing changes were there before from an automatic linter, and many of the spacing changes make it harder to read/compare lines.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should be able to do that. But it would be great to have some rules I could use to normalize the style. Why not merging a PR changing the style and only the style, then merge this one?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored in 0b169f8. All the attribute dicts in __init__ (input_types, input_shapes, output_names, output_types, output_shapes, mask_attrs, embed_attrs, layernorm_attrs, rope_attrs, attention_attrs, mlp_attrs, moe_attrs, lm_head_attrs) and other long-line single-expressions (window_size, intermediate_size, num_kv_heads, activation, graph, attn_softcap, kernel_block_size, lm_head_softcap) were reverted to the original column-aligned style.

:param genai_config: genai_config dict (modified in-place and returned).
:return: the modified *genai_config* dict.
"""
search = genai_config.get("search", {})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following try-except block is attempting to identify default values.

try:
# Override search attributes in config based on values in generation_config.json
gen_config = GenerationConfig.from_pretrained(
model_name_or_path, token=self.hf_token, trust_remote_code=self.hf_remote, **extra_kwargs
)
defaults = {
"bos_token_id": None,
"do_sample": False,
"eos_token_id": None,
"pad_token_id": None,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
}
for key, default_val in defaults.items():
val = getattr(gen_config, key)
if val is not None and val != default_val:
setattr(config, key, getattr(gen_config, key))
except:
pass

Can we modify this inside make_genai_config instead of making a new fix_genai_config method? We can initialize GENAI_SEARCH_DEFAULTS inside make_genai_config.

# HuggingFace YaRN formula:
# final_mscale = yarn_get_mscale(factor, mscale) / yarn_get_mscale(factor, mscale_all_dim)
# where yarn_get_mscale(s, m) = 0.1 * m * log(s) + 1 if s > 1 else 1
def _yarn_get_mscale(s, m):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we modify the existing make_mscale methods or add a new make_* method to perform the new computations rather than nesting this here?

Comment thread src/python/py/models/builders/base.py Outdated
def is_gqa_supported(self) -> bool:
valid_gqa_configurations = {
("cpu", ir.DataType.FLOAT),
("cpu", ir.DataType.FLOAT16),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many FP16 CPU kernels are not implemented. This will cause a performance regression for users. Can we remove this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 0b169f8. ("cpu", ir.DataType.FLOAT16) was removed from is_gqa_supported() with an explanatory comment. Many FP16 CPU GQA kernels are unimplemented so including this configuration would cause a performance regression.

# OLMo v1 uses standard LayerNorm (with mean subtraction), not RMSNorm.
self.layernorm_attrs["simple"] = False
# OlmoLayerNorm hard-codes eps=1e-5 and OlmoConfig has no rms_norm_eps/layer_norm_eps
# attribute, so override the epsilon that the base class defaulted to 1e-6.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default here does not match the comment. Can we move this parsing into the base class so it is done automatically?


def make_layernorm(self, layer_id, layernorm, skip, simple, location):
# OlmoLayerNorm has no learnable weight or bias; inject identity values so
# the base class make_layernorm_op can create the initializer tensors it needs.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no weights or biases to set, do we even need to create the LayerNorm ops? Could we override make_layer to not create such ops?


# For non-quantized models, load the Hugging Face model
print("Loading Qwen2_5_VLForConditionalGeneration model...")
from transformers import Qwen2_5_VLForConditionalGeneration
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the global imports and use local imports?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To handle multiple versions of transformers.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These local imports inside load_weights (e.g. from transformers import Qwen2_5_VLForConditionalGeneration) were already present in the original code — this PR didn't move them. They're kept local to avoid importing heavy VLM classes at module load time unless that specific model class is actually used.

# GQA is not supported in Whisper since there is no attention mask input
return False

def is_packed_attn_supported(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Packed attention is supported in the encoder. You can download the ONNX models from the Foundry Local catalog to see what they look like.

"hidden_states": ["batch_size", self.max_source_positions, self.hidden_size], # ['batch_size', 'num_frames / 2', 'hidden_size']
"present_key_cross": ["batch_size", self.num_attn_heads, self.max_source_positions, self.head_size], # ['batch_size', 'num_heads', 'num_frames / 2', 'head_size']
"present_value_cross": ["batch_size", self.num_attn_heads, self.max_source_positions, self.head_size] # ['batch_size', 'num_heads', 'num_frames / 2', 'head_size']
"hidden_states": [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as here

gelu_1_name = f"{basename}/Gelu_1"
gelu_1_output = f"{gelu_1_name}/output_0"
self.make_node("Gelu", inputs=[f"{conv_1_name}/output_0"], outputs=[gelu_1_output], name=gelu_1_name, approximate="none")
self.make_node("Gelu", inputs=[f"{conv_1_name}/output_0"], outputs=[gelu_1_output], name=gelu_1_name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as here

gelu_2_output = f"{gelu_2_name}/output_0"
self.make_node("Gelu", inputs=[f"{conv_2_name}/output_0"], outputs=[gelu_2_output], name=gelu_2_name, approximate="none")
self.make_value(gelu_2_output, dtype=self.io_dtype, shape=["batch_size", self.hidden_size, self.max_source_positions])
self.make_node("Gelu", inputs=[f"{conv_2_name}/output_0"], outputs=[gelu_2_output], name=gelu_2_name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as here

Comment thread src/python/py/models/builder.py Outdated
from .builders.gemma import Gemma3Model

onnx_model = Gemma3Model(config, io_dtype, onnx_dtype, execution_provider, cache_dir, extra_options)
onnx_model.model_type = "gemma3_text"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be reverted. The model type for the VLM is gemma3, not gemma3_text, because the model type is derived from the architecture name in the config. The LLM uses gemma3_text since it is text-only. This VLM has image and text support so it uses gemma3 as the model type instead (see here for reference).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0b169f8. Changed onnx_model.model_type = "gemma3_text" to onnx_model.model_type = "gemma3" for the Gemma3ForConditionalGeneration (VLM) case.

Comment thread src/python/py/models/builder.py Outdated
fp32_webgpu = execution_provider == "webgpu" and extra_options.get("use_webgpu_fp32", False)
bf16_cuda = precision == "int4" and execution_provider in {"cuda", "trt-rtx"} and extra_options.get("use_cuda_bf16", False)
bf16_cuda = (
precision == "int4" and execution_provider in {"cuda", "trt-rtx"} and extra_options.get("use_cuda_bf16", False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as here



class PvVersion:
"""Simple version of packaging.version.Version."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we import packaging and use it directly?

shutil.rmtree(path)

def get_dirs(self, prefix: str, clean: bool = True) -> tuple[str]:
output_dir = f"dump_models/{prefix}/output"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use path utils to create the path rather than hardcoding a string (e.g. os.path.join or Path(...) instead) for cross-platform compatibility?

return lambda x: x


class ExtTestCase(unittest.TestCase):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is ext an abbreviation for? Can we use the full name and rename the file accordingly?

if not full.startswith(prefix):
raise AssertionError(f"prefix={prefix!r} does not start string {full!r}.")

def check_ort(self, onx: Union["onnx.ModelProto", str]) -> "onnxruntime.InferenceSession": # noqa: F821 # noqa: F821
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why test with ORT directly? Rather than test models with ORT, let's rewrite this file to test with ORT GenAI directly. That will test ORT as well.

_MODEL_ID = "arnir0/Tiny-LLM"


class _AttentionOnlyLlamaModel(LlamaModel):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create a 1-layer ONNX model and a 1-layer torch model to compare instead of attention-only? Then, we know each layer is being created correctly. In that way, we can also use ORT GenAI for the comparison instead of direct ORT.



def _make_tiny_lm_no_cache(fixed_token: int = 3) -> onnx.ModelProto:
"""Returns a minimal ONNX LM with no KV cache.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we don't support models without KV cache inputs/outputs and given we can pass in a 0-filled KV cache to essentially not use a KV cache, I'm not sure how much gain we get from this test. Can we remove this?

# Minimal ChatGLM model source code saved to the model directory so that
# AutoModelForCausalLM.from_pretrained(..., trust_remote_code=True) can load
# the model completely offline during fast tests.
_MODELING_CODE = """\
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead pre-install a version of transformers inside our VM images so we don't have to save the source code? Then, we read the modeling file from the local installation. It will be tedious to have to update this modeling code each time there's a change on Hugging Face's interfaces with newer transformers versions.

return model, config


class TestChatGLM(ExtTestCase):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of modeling files for testing so I won't write the same comment on each one. But we should use ORT GenAI for testing instead of direct ORT since ORT GenAI will test both.


import numpy as np
from ext_test_case import ExtTestCase, hide_stdout, requires_cuda, run_session_or_io_binding

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these LLMs have the same input/output structure when using ORT GenAI. Can we consolidate all of these files into a generic test_random_llm.py file instead? That would greatly reduce the amount of code to review, test, update, etc.

Comment thread test/python/conftest.py


def get_path_for_model(data_path, model_name, precision, device):
if data_path is None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead move the generated torch models to be stored inside the location in test_models? The CIs are heavily dependent on this flag existing.

xadupre and others added 4 commits April 14, 2026 19:20
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
xadupre and others added 3 commits April 14, 2026 19:26
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
… FP16 CPU GQA, add PR comment step

Agent-Logs-Url: https://github.com/microsoft/onnxruntime-genai/sessions/b79ebc36-b773-4f75-9f29-9f8dd5d65330

Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants