fix: preserve multimodal image token counts by Hyperion-shuo · Pull Request #1900 · InternLM/xtuner

Hyperion-shuo · 2026-06-10T05:45:14Z

背景

在 Qwen3-VL / 多模态 RL 训练中，pixel_values 和 image_grid_thw 已经可以传到 training worker，但 num_img_tokens 没有完整传到训练侧的 SequenceContext。

这会导致真实多模态 batch 里明明包含图片，训练日志却显示：

step_consumed_img_tokens=0
img_efficient_attn_ratio=0

修改内容

本 PR 补齐 num_img_tokens 在 RL 多模态训练链路中的传递：

在 MultimodalInfo 中加入 num_img_tokens
在 RLQwen3VLTokenizeFunction 中把 tokenizer 输出的 num_img_tokens 写入 mm_info
在 SequenceContext packing 时保留 num_img_tokens
在 RLColocateTrainer 构造训练 SequenceContext 时恢复 num_img_tokens

修复前：多模态 payload 能进入训练，但 step_consumed_img_tokens=0
修复后：step_consumed_img_tokens 变为非零，训练和 backward 能正常完成

Co-authored-by: shenshuo <shenshuo@pjlab.org.cn>

fix: preserve multimodal image token counts

7fb8da3

hhaAndroid approved these changes Jun 10, 2026

View reviewed changes

hhaAndroid merged commit 8976701 into InternLM:main Jun 10, 2026
7 of 8 checks passed

braisedpork1964 pushed a commit to braisedpork1964/xtuner that referenced this pull request Jun 11, 2026

fix: preserve multimodal image token counts (InternLM#1900)

45694e3

Co-authored-by: shenshuo <shenshuo@pjlab.org.cn>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: preserve multimodal image token counts#1900

fix: preserve multimodal image token counts#1900
hhaAndroid merged 1 commit into
InternLM:mainfrom
Hyperion-shuo:ss/fix-rl-mm-image-token-accounting

Hyperion-shuo commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Hyperion-shuo commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

背景

修改内容

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Hyperion-shuo commented Jun 10, 2026 •

edited

Loading