[BugFix] [RL] Fix cpu cache for rl#7764
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览CI 存在 1 个 Required 任务失败,另有 6 个 Required 任务运行中,尚未完成。
2 任务状态汇总2.1 Required任务 : 3/10 通过
2.2 可选任务 — 23/26 通过
3 失败详情(仅 required)Approval — 环境问题(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 请相关 Maintainer Approve 本 PR 链接: 查看日志 |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-09 22:53:20
📋 Review 摘要
PR 概述:修复 RL 场景下 GPU cache 清理操作未覆盖 else 分支,以及 create_cache_tensor 条件判断逻辑反转两处 Bug
变更范围:fastdeploy/cache_manager/cache_transfer_manager.py、fastdeploy/worker/gpu_model_runner.py
影响面 Tag:[KVCache] [RL]
📝 PR 规范检查
PR 结构完整,但存在两点不规范:①标题同时含 [BugFix] 和 [RL] 两个 Tag,§D1 要求标题仅包含一个官方 Tag;②Checklist 中 Add unit tests 未勾选且 PR 正文未说明跳过单测的原因。
标题建议(可直接复制):
[BugFix] Fix RL cache cleanup and CUDA IPC sync condition bugs
PR 描述建议(可直接复制):
## Motivation
RL 场景下存在两处 Bug:1)cache_transfer_manager._clear_gpu_cache 中 GPU cache 清理操作仅置于 if create_cache_tensor 分支内,else 分支不执行清理,导致 GPU cache 未被正确释放;2)gpu_model_runner.clear_cache 中 create_cache_tensor 条件判断逻辑反转(if not create_cache_tensor 应为 if create_cache_tensor),导致 CUDA IPC 等待逻辑走反,引发 CPU cache 清理异常。
## Modifications
- fastdeploy/cache_manager/cache_transfer_manager.py:将 gpu_cache_kvs.clear()、gpu_cache_k_tensors.clear()、gpu_cache_v_tensors.clear() 及 paddle.device.cuda.empty_cache() 等清理操作从 if create_cache_tensor 块内移至 if/else 块之后,确保两条分支均执行 GPU cache 清理;将 cache_ready_signal 重置操作限制在 if not create_cache_tensor 下执行
- fastdeploy/worker/gpu_model_runner.py:将 if not create_cache_tensor 修正为 if create_cache_tensor,修复 CUDA IPC 等待逻辑的条件判断错误;新增 if not profile 外层保护,避免 profiling 模式下执行同步逻辑;补充等待完成的日志
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/worker/gpu_model_runner.py:3013 |
clear_cache 修复是否需同步到其他硬件 ModelRunner |
| ❓ 疑问 | fastdeploy/cache_manager/cache_transfer_manager.py:562 |
create_cache_tensor=True 时两侧均等待 signal==0,请确认无死锁 |
总体评价
两处 Bug 修复逻辑清晰,代码变更符合 PR 描述。建议补充说明多硬件 runner 的同步情况,并澄清 create_cache_tensor=True 分支的信号量生命周期,以确认不存在死锁风险。
| for name, tensor in self.cache_kvs_map.items(): | ||
| unset_data_ipc(tensor, name, True, False) | ||
| self.cache_ready_signal.value[local_rank] = 0 | ||
| if not profile: |
There was a problem hiding this comment.
🟡 建议 gpu_model_runner.py 属于 GPU 通用路径,architecture.md §A6 指出此类变更通常需同步到其他硬件 ModelRunner。
clear_cache 中原有的 if not create_cache_tensor: 条件反转 Bug,是否在 xpu_model_runner.py、dcu_model_runner.py、gcu_model_runner.py、hpu_model_runner.py、iluvatar_model_runner.py、metax_model_runner.py 中也存在相同问题?若存在,建议一并修复,避免其他硬件 CI 在 RL 场景下出现类似的 CPU cache 清理异常。
| if hasattr(self, "gpu_cache_scales_v_tensors"): | ||
| self.gpu_cache_scales_v_tensors.clear() | ||
| paddle.set_flags({"FLAGS_selected_gpus": f"{self.device}"}) | ||
| paddle.device.cuda.empty_cache() |
There was a problem hiding this comment.
❓ 疑问 当 create_cache_tensor=True 时:
_clear_gpu_cache(transfer manager 侧)等待cache_ready_signal.value[rank] == 0,日志为 "Waiting for gpu runner to unlink"gpu_model_runner.clear_cache(GPU runner 侧)同样等待cache_ready_signal.value[local_rank] == 0,日志为 "Waiting for cache transfer manager to unlink"
两侧均声称在等待对方,但此分支下均未将 signal 置 0。请确认:
create_cache_tensor=True时 signal 的初始值是否始终为 0(即等待为空操作)?- 若 signal 在 IPC 建立阶段被置为非零,谁负责将其清零?是否存在两侧互相等待的死锁风险?
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7764 +/- ##
==========================================
Coverage ? 71.55%
==========================================
Files ? 396
Lines ? 55703
Branches ? 8709
==========================================
Hits ? 39859
Misses ? 13106
Partials ? 2738
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
RL 场景下存在两处 Bug:1)
cache_transfer_manager._clear_gpu_cache中 GPU cache 清理操作(gpu_cache_kvs.clear()等)仅置于if分支内,else分支不执行清理,导致 GPU cache 未被正确释放;2)gpu_model_runner.clear_cache中create_cache_tensor条件判断逻辑反转(if not create_cache_tensor应为if create_cache_tensor),导致等待 cuda IPC 解除的同步逻辑走反,引发 CPU cache 清理异常。Modifications
fastdeploy/cache_manager/cache_transfer_manager.py:将gpu_cache_kvs.clear()、gpu_cache_k_tensors.clear()、gpu_cache_v_tensors.clear()及paddle.device.cuda.empty_cache()等清理操作从if块内移至if/else块之后,确保所有分支均执行清理fastdeploy/worker/gpu_model_runner.py:将if not create_cache_tensor:修正为if create_cache_tensor:,修复 cuda IPC 等待逻辑的条件判断错误,同时补充profile=False时的等待日志Usage or Command
N/A
Accuracy Tests
N/A
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.