Skip to content

fix: preserve multimodal image token counts#1900

Merged
hhaAndroid merged 1 commit into
InternLM:mainfrom
Hyperion-shuo:ss/fix-rl-mm-image-token-accounting
Jun 10, 2026
Merged

fix: preserve multimodal image token counts#1900
hhaAndroid merged 1 commit into
InternLM:mainfrom
Hyperion-shuo:ss/fix-rl-mm-image-token-accounting

Conversation

@Hyperion-shuo

@Hyperion-shuo Hyperion-shuo commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

背景

在 Qwen3-VL / 多模态 RL 训练中,pixel_valuesimage_grid_thw 已经可以传到 training worker,但 num_img_tokens 没有完整传到训练侧的 SequenceContext

这会导致真实多模态 batch 里明明包含图片,训练日志却显示:

step_consumed_img_tokens=0
img_efficient_attn_ratio=0

修改内容

本 PR 补齐 num_img_tokens 在 RL 多模态训练链路中的传递:

在 MultimodalInfo 中加入 num_img_tokens
在 RLQwen3VLTokenizeFunction 中把 tokenizer 输出的 num_img_tokens 写入 mm_info
在 SequenceContext packing 时保留 num_img_tokens
在 RLColocateTrainer 构造训练 SequenceContext 时恢复 num_img_tokens

修复前:多模态 payload 能进入训练,但 step_consumed_img_tokens=0
修复后:step_consumed_img_tokens 变为非零,训练和 backward 能正常完成

image

@hhaAndroid hhaAndroid merged commit 8976701 into InternLM:main Jun 10, 2026
7 of 8 checks passed
braisedpork1964 pushed a commit to braisedpork1964/xtuner that referenced this pull request Jun 11, 2026
Co-authored-by: shenshuo <shenshuo@pjlab.org.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants