feature(xjy): Refine PriorZero Implementation by xiongjyu · Pull Request #441 · opendilab/LightZero

xiongjyu · 2025-11-20T15:48:21Z

这个 PR 主要完善了 PriorZero的实现与开发流程，修复了若干影响训练正确性和稳定性的关键问题，并对训练逻辑、损失计算、数据采集进行了系统性的增强。

本 PR 已完成的工作
• 修复了 PriorZero 训练流程中的多个关键 bug，包括 game segment 构建、loss 计算、log-prob 对齐以及 action 处理中的错误。
• 完善了 REINFORCE / RFT 风格的策略优化实现，在 buffer 中正确存储并使用 old_logprob，保证策略更新的正确性。
• 补充并规范了训练过程中的统计指标，包括 KL divergence、policy entropy 等，用于更好地监控训练状态。
• 优化了 Collector 与 Replay Buffer 的数据流转逻辑，提升数据一致性与采样稳定性，减少隐式错误。
• 引入并验证了单卡场景下的 vLLM 权重同步机制。
• 多 GPU / 多节点场景下的 vLLM 权重同步与稳定性验证

…_llm_prior, and SFT loss

…lect to cprofile.

…ed the REINFORCE-series loss computation.

…me. Single-GPU works; multi-GPU not tested yet.

puyuan1996 · 2025-12-15T14:38:21Z

+    for i in range(num_engines):
+        bundle_indices = None
+        if tensor_parallel_size > 1:
+            bundle_indices = get_bundle_indices(shared_pg, i, tensor_parallel_size)


这里是参考的ray官方改进吗

这个vllm_engine基本和openrlhf这部分是一样的；不过目前只使用一个vllm,并且tensor_parallel_size =1；因为显存够

…ple for world-model training; train LLM only on latest trajectories

Docs: Expand PriorZero (Jericho) README with detailed configuration and usage

…rft' into dev-multitask-balance-clean-rft-vl

…evels, per-level eval logging - Switch from single-task to multi-task training/eval across all 40 BabyAI levels - Add per-level TensorBoard logging in evaluator (WM+LLMPrior and LLMPrior modes) - Run initial evaluation before training loop starts - Align hyperparams: max_steps=20, prompt_max_len=512, model=qwen2.5-7b - Increase eval intervals (wm: 2000, llm: 200) for 40-level multi-task Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…Inter-RL levels - Add 1D→2D unsqueeze guard in tokenizer and HFLanguageRepresentationNetwork to prevent BERT ValueError when batch dim is missing during evaluation - Correct task list from 40 to 18 levels based on HF AgentGym-RL-Data-ID dataset (levels: 1-11, 19-21, 30-31, 33, 36) - Increase evaluator_env_num from 2 to 8 for ~4x faster multi-task eval Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ajectory dump - Add _should_continue_collect + drain_vllm_iter to PriorZeroCollector to fix the vLLM-TP deadlock that hung _sync_prompts_for_tp when DDP ranks finished episodes at different step counts (mirrors the existing eval fix) - Add _sync_prompts_for_tp / drain_vllm_iter in DataProcessor: all_gather prompts over the TP subgroup so partners submit matched vllm.generate calls; required for any vllm_tensor_parallel_size > 1 with DDP > 1 - Introduce setup_priorzero_logging with priorzero.main/train/eval loggers (file + rank-0 console, NullHandler elsewhere); replace scattered loguru/print calls in entry, trainer, datafactory, vllm worker - Evaluator: dist.barrier between WM and WM_LLMPrior eval, save per-episode eval trajectories as JSON, extend per-level TB stats with mean/std/min/max - MuZero evaluator: add total_finishes hard counter to prevent eval hang when episodes are unevenly distributed across envs - Guards: empty-batch return in UniZeroPolicy._forward_eval, zero-length input return in HFLanguageRepresentationNetwork - BabyAI: bump prompt_max_len 512→4096 to fit obs budget, align evaluator_env_num with eval level count Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…size floor, Fix dual CUDA OOM in train→vLLM handoff and large-batch forward pass - Remove `max(micro_train_batch_size, 32)` floor in PolicyModel.forward and ReferenceModel.forward; use micro_train_batch_size directly (2 vs 32), reducing per-chunk logits from ~37 GiB to ~2.3 GiB for Qwen2.5-7B - Remove redundant batch-level .to(device) before chunking loop to avoid duplicating full batch on GPU alongside per-chunk slices - Reduce default micro_train_batch_size from 4 to 2 for 7B model headroom - Reorder offload_states() before _broadcast_to_vllm() in train_batch so DeepSpeed optimizer states are freed to CPU before vLLM wake_up() reclaims ~43 GiB for weights+KV cache (was OOM at cumem_allocator.cpp:62) - Defer logits.to(float32) to training path only (return_entropy=True), avoiding a 42 GiB fp32 allocation during old_action_log_probs forward; bf16 path in log_probs_from_logits already handles numerical stability - Reduce micro_train_batch_size 4→2 in BabyAI config Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…aseline config - Fix eval_only_llm_prior episode refill: with env_num=4 and n_episode=18, finished envs were never re-added, so only 4 of 18 levels were evaluated. Add refill logic mirroring eval_with_llm_prior to cover all levels. - Reduce format_weight 0.5→0.1 to prevent constant-positive advantage bias from always-correct format rewards causing response length collapse (88→10) - Increase rft_kl_coef 0.001→0.01 and entropy_loss_coef 0→0.01 to anchor policy against drift and entropy collapse - Accept **kwargs in GameSegment.append() so env-specific fields like raw_obs_text pass through without TypeError - Add BabyAI UniZero baseline config (no LLM) with WM hyperparameters aligned to PriorZero for fair ablation comparison Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…or, harden RFT training - Refactor PriorZero evaluator TB logging: unified x-axis (env_step) for all three eval modes (wm_only, wm_llm, llm_only) with new hierarchical tags eval/{mode}/agg/* and eval/{mode}/per_level/*; old tags preserved under deprecated/ prefix for transition - Add env_step-based eval frequency (wm_eval_freq_envsteps, llm_eval_freq_envsteps) to both BabyAI and Jericho configs, with iter-based fallback when set to 0 - Remove phase guard on eval_wm_only and eval_llm_only so all three modes run in every phase, enabling consistent cross-phase comparison - Add eval_wm_only() method to PriorZero evaluator with per-level tracking (previously delegated to super().eval() without level info) - Create MuZeroPerLevelEvaluator for UniZero baseline with dual x-axis TB logging (env_step primary, train_iter secondary) and per-level reward breakdown matching PriorZero tag structure - Add KL early stopping in BatchPPOTrainer: skip remaining gradient updates when ref_kl exceeds kl_early_stop_threshold (default 0 = disabled) - Replace hard assert with warning + filter in DataProcessor for tokenizer round-trip mismatches to avoid training crashes - Tune BabyAI RFT hyperparams: format_weight 0.1→0.3, rft_kl_coef 0.01→0.1, entropy_loss_coef 0.01→0.001, replay_buffer 300k→500k, collect_num_simulations 50→25 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ance-clean-rft-babyai-textcraft

…-babyai-textcraft feature(pu): add babyai and textcraft env and configs for priorzero

…ance-clean-rft-vl

feature(pu): add image-based/vlm version of priorzero

xiongjyu added 5 commits November 20, 2025 12:48

Fix game_segment/weighted_total_loss bugs and refine prompts, compute…

a3a2d69

…_llm_prior, and SFT loss

Fixed the accumulate_steps bug and added cprofile functionality.

959a558

Refine the code and fix the bug in data collection.

ecedc5f

Add REINFORCE-style losses and store old_logprob in the buffer.

2d53d22

Fix the get_llm_prior bug so that every action receives a logprob

c608600

xiongjyu commented Nov 24, 2025

View reviewed changes

Comment thread zoo/jericho/priorzero/priorzero_policy.py Outdated

puyuan1996 reviewed Nov 24, 2025

View reviewed changes

Comment thread zoo/jericho/priorzero/priorzero_policy.py Outdated

fixed the history bug in the build_llm_prompt and logs in forward_learn

15e39f6

xiongjyu deleted the branch opendilab:dev-multitask-balance-clean-rft November 24, 2025 14:28

xiongjyu closed this Nov 24, 2025

xiongjyu deleted the dev-multitask-balance-clean-rft branch November 24, 2025 14:28

xiongjyu reopened this Nov 24, 2025

xiongjyu added 4 commits November 24, 2025 22:35

rename advantage_tensor on rft

7c9acd9

Fixed the action out-of-bounds bug and added a record for forward_col…

738f300

…lect to cprofile.

Fixed the misalignment between old_log_prob and log_prob, and correct…

0a166f6

…ed the REINFORCE-series loss computation.

add some logs for analysying

4f3668e

puyuan1996 added the research Research work in progress label Nov 28, 2025

xiongjyu added 10 commits November 30, 2025 01:47

Polish the code and standardize the format.

2985e60

Add kL divergence in rft and llm_prior_entropy in collect

ff98006

polish config and format

7e43e45

delete unused files

d6555e5

Decouple the training of world_model and LLM.

b7d42ee

add cache in the jericho

95e2347

Separate sync and async entry points to simplify the program.

9682486

Reference OpenRLHF’s implementation to update vLLM weights in real ti…

0a38197

…me. Single-GPU works; multi-GPU not tested yet.

delete unused orz files

e361039

fix a small bug

f957db9

puyuan1996 reviewed Dec 15, 2025

View reviewed changes

Comment thread zoo/jericho/priorzero/vllm_utils/vllm_engine_ray.py Outdated

Fix action='go' bug; optimize replay buffer with larger capacity; sam…

628d7d2

…ple for world-model training; train LLM only on latest trajectories

xiongjyu and others added 30 commits April 4, 2026 17:52

fix a bug

d040b93

fix timeout bug when running zork1

9095783

tmp

09ddd49

fix(pu): fix jericho c timeout bug

a6ae249

fix the bug duing to zork1

888ae2a

add llm_no_collect mode

caed1c1

docs(priorzero): 补充Atari等vision环境改进方向

6f30b73

Merge pull request #10 from xiongjyu/codex/-zoo/jericho/priorzero-readme

c575dfc

Docs: Expand PriorZero (Jericho) README with detailed configuration and usage

delete unused files

0b93856

Merge remote-tracking branch 'origin-xjy/dev-multitask-balance-clean-…

40c7d48

…rft' into dev-multitask-balance-clean-rft-vl

polish(pu): polish lunarlander-image uz config

4d50c4d

polish(pu): polish lunarlander-image related code

02ea915

tmp

dad9df9

feature(pu): add babyai env and related priorzero configs

e318dcc

fix(pu): fix babyai return

b51efff

feature(pu): add textcraft env and configs

58a9824

add llm as policy and rlft ablations

68ae363

add qwen as backbone

6793702

tmp

2309ab6

fix(pu): fix babyai_env.py

35d4c67

Merge branch 'dev-multitask-balance-clean-rft' into dev-multitask-bal…

e5648f9

…ance-clean-rft-babyai-textcraft

Merge pull request #11 from opendilab/dev-multitask-balance-clean-rft…

d3b2b60

…-babyai-textcraft feature(pu): add babyai and textcraft env and configs for priorzero

Merge branch 'dev-multitask-balance-clean-rft' into dev-multitask-bal…

c986457

…ance-clean-rft-vl

Merge pull request #8 from opendilab/dev-multitask-balance-clean-rft-vl

e17d3ee

feature(pu): add image-based/vlm version of priorzero

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(xjy): Refine PriorZero Implementation#441

feature(xjy): Refine PriorZero Implementation#441
xiongjyu wants to merge 187 commits into
opendilab:dev-multitask-balance-clean-rftfrom
xiongjyu:dev-multitask-balance-clean-rft

xiongjyu commented Nov 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

puyuan1996 Dec 15, 2025

Uh oh!

xiongjyu Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xiongjyu commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

puyuan1996 Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

xiongjyu Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xiongjyu commented Nov 20, 2025 •

edited

Loading