[Cherry-Pick] [BugFix] [RL] Fix cpu cache for rl (#7764)#7765
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览⏳ CI 运行中,部分 Required 任务尚未完成(运行中 6 个,等待中 1 个)。
2 任务状态汇总2.1 Required任务 : 3/10 通过
2.2 可选任务 — 22/26 通过
3 失败详情(仅 required)无 required 失败任务。 |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-09 22:53:21
📋 Review 摘要
PR 概述:修复 RL 场景下 GPU cache 未正确释放及 CPU cache 清理同步异常的两处 Bug
变更范围:fastdeploy/cache_manager/cache_transfer_manager.py、fastdeploy/worker/gpu_model_runner.py
影响面 Tag:[RL] [KVCache]
📝 PR 规范检查
Cherry-Pick 标题格式不合规(多个 Tag 且 Tag 之间有空格);Checklist 末项(Cherry-Pick 提交验证)未勾选。
标题建议(可直接复制):
[Cherry-Pick][BugFix] Fix cpu cache for rl (#7764)
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
RL 场景下存在两处 Bug:1)`cache_transfer_manager._clear_gpu_cache` 中 GPU cache 清理操作(`gpu_cache_kvs.clear()` 等)仅置于 `if` 分支内,`else` 分支不执行清理,导致 GPU cache 未被正确释放;2)`gpu_model_runner.clear_cache` 中 `create_cache_tensor` 条件判断逻辑反转(`if not create_cache_tensor` 应为 `if create_cache_tensor`),导致等待 cuda IPC 解除的同步逻辑走反,引发 CPU cache 清理异常。
## Modifications
- `fastdeploy/cache_manager/cache_transfer_manager.py`:将 `gpu_cache_kvs.clear()`、`gpu_cache_k_tensors.clear()`、`gpu_cache_v_tensors.clear()` 及 `paddle.device.cuda.empty_cache()` 等清理操作从 `if` 块内移至 `if/else` 块之后,确保所有分支均执行清理;新增 `paddle.set_flags({"FLAGS_selected_gpus": ...})` 确保在正确设备上清空缓存
- `fastdeploy/worker/gpu_model_runner.py`:将 `if not create_cache_tensor:` 修正为 `if create_cache_tensor:`,修复 cuda IPC 等待逻辑的条件判断错误,同时补充 `profile=False` 时的等待日志
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 📝 PR 规范 | - | Cherry-Pick 标题格式不合规(多 Tag 且 Tag 间有空格);Checklist 末项未勾选 |
| ❓ 疑问 | fastdeploy/cache_manager/cache_transfer_manager.py:764 |
paddle.set_flags 为进程级全局操作,多卡并发时是否线程安全 |
| ❓ 疑问 | fastdeploy/worker/gpu_model_runner.py:2875 |
create_cache_tensor=True 时两端均等待 cache_ready_signal==0 但均不设置,请确认无死锁风险 |
总体评价
本 PR 正确修复了 RL 场景下 GPU cache 清理路径遗漏(else 分支未清理)和 IPC 同步条件判断反转两个 Bug,主体逻辑方向正确。建议作者澄清 cache_ready_signal 在 create_cache_tensor=True 场景下的完整协调机制,并修正 Cherry-Pick 标题格式。
| self.gpu_cache_scales_k_tensors.clear() | ||
| if hasattr(self, "gpu_cache_scales_v_tensors"): | ||
| self.gpu_cache_scales_v_tensors.clear() | ||
| paddle.set_flags({"FLAGS_selected_gpus": f"{self.device}"}) |
There was a problem hiding this comment.
❓ 疑问 paddle.set_flags({"FLAGS_selected_gpus": f"{self.device}"}) 修改的是进程级全局 flag,在多卡(TP > 1)或多线程场景中,此调用可能与其他线程/进程的 GPU 上下文产生竞争。
请确认:
- 此处调用
set_flags的目的是什么(是否为解决某个特定的empty_cache选错设备问题)? - 在 TP 并行场景下,多个 rank 同时执行
_clear_gpu_cache时是否有并发安全问题?
若仅为保证 empty_cache 作用于正确设备,可考虑使用 paddle.device.set_device(f"gpu:{self.device}") 或显式的 with paddle.CUDAPlace(self.device): 上下文,避免全局 flag 的副作用。
| if create_cache_tensor: | ||
| if self.fd_config.cache_config.num_cpu_blocks > 0: | ||
| logger.info("Waiting for cache transfer manager to unlink cuda ipc") | ||
| while self.cache_ready_signal.value[local_rank] != 0: |
There was a problem hiding this comment.
❓ 疑问 请确认 create_cache_tensor=True 时两端 cache_ready_signal 的同步协议是否存在死锁风险。
当前逻辑:
cache_transfer_manager._clear_gpu_cache(create_cache_tensor=True):等待cache_ready_signal[rank] == 0(注释:"Waiting for gpu runner to unlink cuda ipc")gpu_model_runner.clear_cache(create_cache_tensor=True):同样等待cache_ready_signal[local_rank] == 0(注释:"Waiting for cache transfer manager to unlink cuda ipc")
两端均等待对方将 signal 设为 0,但在 create_cache_tensor=True 的代码路径下均未主动将 signal 置 0(signal=0 仅在 if not create_cache_tensor: 分支设置)。请确认 signal 的初始值及谁负责在此场景下将其置 0,避免潜在死锁。
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/2.6 #7765 +/- ##
==============================================
Coverage ? 71.83%
==============================================
Files ? 378
Lines ? 53931
Branches ? 8434
==============================================
Hits ? 38742
Misses ? 12425
Partials ? 2764
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
228987a
into
PaddlePaddle:release/2.6
Motivation
RL 场景下存在两处 Bug:1)
cache_transfer_manager._clear_gpu_cache中 GPU cache 清理操作(gpu_cache_kvs.clear()等)仅置于if分支内,else分支不执行清理,导致 GPU cache 未被正确释放;2)gpu_model_runner.clear_cache中create_cache_tensor条件判断逻辑反转(if not create_cache_tensor应为if create_cache_tensor),导致等待 cuda IPC 解除的同步逻辑走反,引发 CPU cache 清理异常。Modifications
fastdeploy/cache_manager/cache_transfer_manager.py:将gpu_cache_kvs.clear()、gpu_cache_k_tensors.clear()、gpu_cache_v_tensors.clear()及paddle.device.cuda.empty_cache()等清理操作从if块内移至if/else块之后,确保所有分支均执行清理fastdeploy/worker/gpu_model_runner.py:将if not create_cache_tensor:修正为if create_cache_tensor:,修复 cuda IPC 等待逻辑的条件判断错误,同时补充profile=False时的等待日志Usage or Command
N/A
Accuracy Tests
N/A
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.