Skip to content

Support transformers v5#481

Merged
jlamypoirier merged 20 commits intomainfrom
jlp_transformers_v5
Apr 24, 2026
Merged

Support transformers v5#481
jlamypoirier merged 20 commits intomainfrom
jlp_transformers_v5

Conversation

@jlamypoirier
Copy link
Copy Markdown
Collaborator

@jlamypoirier jlamypoirier commented Apr 9, 2026

Summary

  • Support transformers v5 alongside v4 in Fast-LLM core and external models (_TRANSFORMERS_V4 flag)
  • Fix MTP Llama and Apriel2 external model compatibility with v5 API changes
  • Fix LLaVA plan.py: checkpoint keys are identical in v4 and v5 (save_pretrained always uses v4 naming); remove incorrect version-conditional key prefix
  • Fix TestGDNEquivalence: branch Qwen3NextDynamicCache vs DynamicCache, cache_position kwarg, and recurrent state access path between v4 and v5
  • Fix converter bugs found by new HF roundtrip tests: Mixtral mlp_bias injection, Qwen2 head_dim export, Apriel2 config omissions
  • Add HF → Fast-LLM → HF roundtrip tests for llama, mistral, qwen2, mixtral, mtp_llama, apriel2_attention, apriel2_supernet
  • Pin transformers to v5 (>=5.0.0,<6.0.0) in Dockerfile

jlamypoirier and others added 12 commits April 22, 2026 20:34
- Widen transformers version constraint to >=4.57.3,<6.0.0
- Version-gate PretrainedConfig init (__init__ vs __post_init__) and dtype attribute (torch_dtype vs dtype) using dataclasses.is_dataclass detection
- Fall back to transformers.modeling_utils.no_init_weights for 4.x
- Support both rope_parameters (5.x) and rope_theta/rope_scaling (4.x) in Llama import/export config
- Handle both attribute paths for vision_tower in multimodal HF model test
- Fix mtp_llama LlamaRotaryEmbedding to handle both rope config formats
- Add _gdn_fla_available and _kda_fla_available flags to apriel2; use them to properly skip backup SSM tests when fla kernels are absent
- Update CLAUDE.md with redirect-to-file and external model test guidance

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…compatibility

- apriel2/modeling_apriel2.py: add _TRANSFORMERS_V5 flag; fix _tied_weights_keys
  to dict format for 5.x (list for 4.x); add rope_parameters to PixtralRotaryEmbedding
  SimpleNamespace config
- mtp_llama/modeling_mtp_llama.py: add _TRANSFORMERS_V5 flag; fix _tied_weights_keys
- apriel2/conversion/llava/config.py: handle 5.x rope_parameters dict in text and
  vision configs alongside 4.x rope_theta
- apriel2/conversion/llava/plan.py: version-conditional source weight key prefixes
  (5.x LlavaForConditionalGeneration adds model. prefix to submodules)
- test_cache_contracts.py: update DynamicLayer.get_mask_sizes calls to pass int in 5.x
  (query_length) vs tensor in 4.x; update sdpa_mask signature for 5.x (q_length/q_offset)
- test_convert_from_llava.py: use version-conditional embed_tokens source key
- test_equivalence.py: fix get_image_features handling — 5.x returns BaseModelOutput
  with projected features in pooler_output (not last_hidden_state)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix num_blocks off-by-one in import_config (was subtracting 1)
- Fix num_hidden_layers off-by-one in export_config (was adding 1)
- Fix mtp_heads index off-by-one in get_converters (was prediction_distance - 1)
- Fix hidden state collection order in MTPLlamaModel: add embedding before
  trunk loop and add trunk layer outputs inside the loop, consistent with
  standard transformers @capture_outputs behavior

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update TOKENIZER_NAME from "bigcode/santacoder" to "gpt2" and update all
hardcoded token values in data tests to match the gpt2 vocabulary.
Also fix deprecated huggingface_hub.HfFolder.get_token() → get_token().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ERS_V4

- Deduplicate rope-type dispatch in LlamaAttentionConverter.import_config by
  normalizing rope_params/rope_theta from either checkpoint format first
- Rename _TRANSFORMERS_V5 → _TRANSFORMERS_V4 (inverted flag) so v4 compat
  code is in `if _TRANSFORMERS_V4:` blocks — grep-and-delete to drop v4
- Flip all if/else so v5 code is the default path and v4 is the guarded branch
- Import _TRANSFORMERS_V4 from config.py in huggingface.py; replace try/except
  with explicit if/else
- Add comments for v5 changes that can't use the flag (TYPE_CHECKING guard,
  checkpoint format detection, model.model structure)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use tuple prefixes unpacked into W(...) instead of the / operator,
keeping the _TRANSFORMERS_V4 branching for the path prefix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Keep llava_layer/apriel_layer intermediate variables (with / operator)
in loops; only the layer root W() calls use *prefix unpacking.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- apriel2: override tie_weights() in Apriel2PreTrainedModel to recompute
  MistralRotaryEmbedding.inv_freq after v5 meta-device loading zeroes it
  (non-persistent buffers not in checkpoint are materialized as zeros)
- apriel2: hardcode _attn_implementation="eager" in preprocess mask config
  so an explicit float mask is always built (v5 sdpa returns None otherwise)
- apriel2: use version-conditional kwarg name for create_causal_mask
  (inputs_embeds in v5, input_embeds in v4)
- apriel2 test: compare only non-padding positions in test_logits_match
- mtp_llama: update LlamaRotaryEmbedding and config for v5 compatibility
- gpt/huggingface: remove dead code line

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This reverts commit 39196c6.
jlamypoirier and others added 7 commits April 22, 2026 23:57
- Replace removed Qwen3NextDynamicCache with DynamicCache; remove dropped
  cache_position kwarg from GDN forward; fix recurrent_states access path
- Fix rope_theta KeyError (moved to rope_parameters dict in v5)
- Fix attention_mask device mismatch in integration test
- Expand+contiguous attn mask before SDPA to satisfy CUDA kernel contiguity
- Use model._attn_implementation for causal mask creation so pure-causal
  inputs get None mask and SDPA uses is_causal=True (matching Qwen2 numerics)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… usage

supernet_path and fastllm_path are removed as soon as they've been consumed,
so only one checkpoint directory exists at a time instead of all three.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds tests/models/test_hf_roundtrip.py covering HF → Fast-LLM → HF
roundtrip for llama, mistral, qwen2, mixtral, mtp_llama, apriel2_attention,
and apriel2_supernet. Tests download only the HF config (no weights), build
a tiny model, convert both ways, and compare config and weights exactly.

Converter fixes found by the tests:
- mixtral: MixtralMLPConverter.import_config was missing `mlp_bias = False`
  injection (KeyError); export_config was leaking `mlp_bias` into the output
  dict (MixtralConfig has no such field).
- qwen2: export_config was writing `head_dim` (not a standard Qwen2Config
  field); added constraint check that heads * head_size == hidden_size on
  export; added assertion that use_mrope is not True on import.
- apriel2: omit window_size when None; omit add_linear_biases when True
  (default); omit sampling_strategy when uniform (default).

Also removes the Fast-LLM roundtrip fixture from test_integration.py since
test_hf_roundtrip.py covers that path more thoroughly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
save_pretrained() always serializes with 4.x naming (language_model.model.*,
vision_tower.*, multi_modal_projector.*) regardless of in-memory attribute
layout changes in 5.x. Remove the version-conditional key prefix logic
in plan.py and the matching conditional in test_convert_from_llava.py,
which used the wrong v5 key (model.language_model.model.*) causing
KeyError on all plan-based tests under transformers 5.x.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- 4.x has Qwen3NextDynamicCache which requires cache_position kwarg and
  exposes recurrent state at qwen_cache.recurrent_states[0]
- 5.x removed Qwen3NextDynamicCache in favor of DynamicCache(config=...),
  dropped cache_position kwarg, and moved recurrent state to
  qwen_cache.layers[0].recurrent_states

Use _TRANSFORMERS_V4 to branch all three differences.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jlamypoirier jlamypoirier marked this pull request as ready for review April 24, 2026 02:48
@jlamypoirier jlamypoirier merged commit 63574be into main Apr 24, 2026
1 of 2 checks passed
@jlamypoirier jlamypoirier deleted the jlp_transformers_v5 branch April 24, 2026 02:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant