Support transformers v5 by jlamypoirier · Pull Request #481 · ServiceNow/Fast-LLM

jlamypoirier · 2026-04-09T19:41:35Z

Summary

Support transformers v5 alongside v4 in Fast-LLM core and external models (_TRANSFORMERS_V4 flag)
Fix MTP Llama and Apriel2 external model compatibility with v5 API changes
Fix LLaVA plan.py: checkpoint keys are identical in v4 and v5 (save_pretrained always uses v4 naming); remove incorrect version-conditional key prefix
Fix TestGDNEquivalence: branch Qwen3NextDynamicCache vs DynamicCache, cache_position kwarg, and recurrent state access path between v4 and v5
Fix converter bugs found by new HF roundtrip tests: Mixtral mlp_bias injection, Qwen2 head_dim export, Apriel2 config omissions
Add HF → Fast-LLM → HF roundtrip tests for llama, mistral, qwen2, mixtral, mtp_llama, apriel2_attention, apriel2_supernet
Pin transformers to v5 (>=5.0.0,<6.0.0) in Dockerfile

- Widen transformers version constraint to >=4.57.3,<6.0.0 - Version-gate PretrainedConfig init (__init__ vs __post_init__) and dtype attribute (torch_dtype vs dtype) using dataclasses.is_dataclass detection - Fall back to transformers.modeling_utils.no_init_weights for 4.x - Support both rope_parameters (5.x) and rope_theta/rope_scaling (4.x) in Llama import/export config - Handle both attribute paths for vision_tower in multimodal HF model test - Fix mtp_llama LlamaRotaryEmbedding to handle both rope config formats - Add _gdn_fla_available and _kda_fla_available flags to apriel2; use them to properly skip backup SSM tests when fla kernels are absent - Update CLAUDE.md with redirect-to-file and external model test guidance Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…compatibility - apriel2/modeling_apriel2.py: add _TRANSFORMERS_V5 flag; fix _tied_weights_keys to dict format for 5.x (list for 4.x); add rope_parameters to PixtralRotaryEmbedding SimpleNamespace config - mtp_llama/modeling_mtp_llama.py: add _TRANSFORMERS_V5 flag; fix _tied_weights_keys - apriel2/conversion/llava/config.py: handle 5.x rope_parameters dict in text and vision configs alongside 4.x rope_theta - apriel2/conversion/llava/plan.py: version-conditional source weight key prefixes (5.x LlavaForConditionalGeneration adds model. prefix to submodules) - test_cache_contracts.py: update DynamicLayer.get_mask_sizes calls to pass int in 5.x (query_length) vs tensor in 4.x; update sdpa_mask signature for 5.x (q_length/q_offset) - test_convert_from_llava.py: use version-conditional embed_tokens source key - test_equivalence.py: fix get_image_features handling — 5.x returns BaseModelOutput with projected features in pooler_output (not last_hidden_state) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Fix num_blocks off-by-one in import_config (was subtracting 1) - Fix num_hidden_layers off-by-one in export_config (was adding 1) - Fix mtp_heads index off-by-one in get_converters (was prediction_distance - 1) - Fix hidden state collection order in MTPLlamaModel: add embedding before trunk loop and add trunk layer outputs inside the loop, consistent with standard transformers @capture_outputs behavior Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Update TOKENIZER_NAME from "bigcode/santacoder" to "gpt2" and update all hardcoded token values in data tests to match the gpt2 vocabulary. Also fix deprecated huggingface_hub.HfFolder.get_token() → get_token(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ERS_V4 - Deduplicate rope-type dispatch in LlamaAttentionConverter.import_config by normalizing rope_params/rope_theta from either checkpoint format first - Rename _TRANSFORMERS_V5 → _TRANSFORMERS_V4 (inverted flag) so v4 compat code is in `if _TRANSFORMERS_V4:` blocks — grep-and-delete to drop v4 - Flip all if/else so v5 code is the default path and v4 is the guarded branch - Import _TRANSFORMERS_V4 from config.py in huggingface.py; replace try/except with explicit if/else - Add comments for v5 changes that can't use the flag (TYPE_CHECKING guard, checkpoint format detection, model.model structure) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Use tuple prefixes unpacked into W(...) instead of the / operator, keeping the _TRANSFORMERS_V4 branching for the path prefix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Keep llava_layer/apriel_layer intermediate variables (with / operator) in loops; only the layer root W() calls use *prefix unpacking. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- apriel2: override tie_weights() in Apriel2PreTrainedModel to recompute MistralRotaryEmbedding.inv_freq after v5 meta-device loading zeroes it (non-persistent buffers not in checkpoint are materialized as zeros) - apriel2: hardcode _attn_implementation="eager" in preprocess mask config so an explicit float mask is always built (v5 sdpa returns None otherwise) - apriel2: use version-conditional kwarg name for create_causal_mask (inputs_embeds in v5, input_embeds in v4) - apriel2 test: compare only non-padding positions in test_logits_match - mtp_llama: update LlamaRotaryEmbedding and config for v5 compatibility - gpt/huggingface: remove dead code line Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

This reverts commit 39196c6.

- Replace removed Qwen3NextDynamicCache with DynamicCache; remove dropped cache_position kwarg from GDN forward; fix recurrent_states access path - Fix rope_theta KeyError (moved to rope_parameters dict in v5) - Fix attention_mask device mismatch in integration test - Expand+contiguous attn mask before SDPA to satisfy CUDA kernel contiguity - Use model._attn_implementation for causal mask creation so pure-causal inputs get None mask and SDPA uses is_causal=True (matching Qwen2 numerics) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… usage supernet_path and fastllm_path are removed as soon as they've been consumed, so only one checkpoint directory exists at a time instead of all three. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds tests/models/test_hf_roundtrip.py covering HF → Fast-LLM → HF roundtrip for llama, mistral, qwen2, mixtral, mtp_llama, apriel2_attention, and apriel2_supernet. Tests download only the HF config (no weights), build a tiny model, convert both ways, and compare config and weights exactly. Converter fixes found by the tests: - mixtral: MixtralMLPConverter.import_config was missing `mlp_bias = False` injection (KeyError); export_config was leaking `mlp_bias` into the output dict (MixtralConfig has no such field). - qwen2: export_config was writing `head_dim` (not a standard Qwen2Config field); added constraint check that heads * head_size == hidden_size on export; added assertion that use_mrope is not True on import. - apriel2: omit window_size when None; omit add_linear_biases when True (default); omit sampling_strategy when uniform (default). Also removes the Fast-LLM roundtrip fixture from test_integration.py since test_hf_roundtrip.py covers that path more thoroughly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

save_pretrained() always serializes with 4.x naming (language_model.model.*, vision_tower.*, multi_modal_projector.*) regardless of in-memory attribute layout changes in 5.x. Remove the version-conditional key prefix logic in plan.py and the matching conditional in test_convert_from_llava.py, which used the wrong v5 key (model.language_model.model.*) causing KeyError on all plan-based tests under transformers 5.x. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- 4.x has Qwen3NextDynamicCache which requires cache_position kwarg and exposes recurrent state at qwen_cache.recurrent_states[0] - 5.x removed Qwen3NextDynamicCache in favor of DynamicCache(config=...), dropped cache_position kwarg, and moved recurrent state to qwen_cache.layers[0].recurrent_states Use _TRANSFORMERS_V4 to branch all three differences. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jlamypoirier force-pushed the jlp_transformers_v5 branch from d412c8d to 6b39a0e Compare April 22, 2026 22:53

jlamypoirier and others added 12 commits April 22, 2026 20:34

Replace W-object path chaining with explicit W() calls in plan.py

93b3485

Use tuple prefixes unpacked into W(...) instead of the / operator, keeping the _TRANSFORMERS_V4 branching for the path prefix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Restore loop structure in plan.py; use prefix tuples only at layer init

912b71c

Keep llava_layer/apriel_layer intermediate variables (with / operator) in loops; only the layer root W() calls use *prefix unpacking. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix

563ed77

Fix mtp llama test

c51812c

misc

a091518

Revert "misc"

94ecb21

This reverts commit 39196c6.

jlamypoirier force-pushed the jlp_transformers_v5 branch from 6b39a0e to 94ecb21 Compare April 23, 2026 00:34

jlamypoirier and others added 7 commits April 22, 2026 23:57

Pin transformers to v5 in Dockerfile

995aa59

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix preparator test expected values broken by bad merge with main

5933638

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jlamypoirier marked this pull request as ready for review April 24, 2026 02:48

Merge branch 'main' into jlp_transformers_v5

6630192

jlamypoirier merged commit 63574be into main Apr 24, 2026
1 of 2 checks passed

jlamypoirier deleted the jlp_transformers_v5 branch April 24, 2026 02:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support transformers v5#481

Support transformers v5#481
jlamypoirier merged 20 commits intomainfrom
jlp_transformers_v5

jlamypoirier commented Apr 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlamypoirier commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jlamypoirier commented Apr 9, 2026 •

edited

Loading