Skip to content

[FDConfig] 默认开启 FD_ENABLE_E2W_TENSOR_CONVERT 和 FD_ENGINE_TASK_QUEUE_WITH_SHM#7746

Open
sunlei1024 wants to merge 9 commits intoPaddlePaddle:developfrom
sunlei1024:feat/default-enable-shm
Open

[FDConfig] 默认开启 FD_ENABLE_E2W_TENSOR_CONVERT 和 FD_ENGINE_TASK_QUEUE_WITH_SHM#7746
sunlei1024 wants to merge 9 commits intoPaddlePaddle:developfrom
sunlei1024:feat/default-enable-shm

Conversation

@sunlei1024
Copy link
Copy Markdown
Collaborator

Motivation

默认开启 FD_ENABLE_E2W_TENSOR_CONVERTFD_ENGINE_TASK_QUEUE_WITH_SHM
以提升 Engine-to-Worker 的张量传递效率,以及引擎任务队列基于共享内存(SHM)的通信性能。

该优化在大模型推理场景下可以减少序列化/反序列化开销,提高吞吐和延迟表现。

Modifications

  • fastdeploy/envs.py
    • FD_ENABLE_E2W_TENSOR_CONVERT 默认值由 0 改为 1
    • FD_ENGINE_TASK_QUEUE_WITH_SHM 默认值由 0 改为 1

行为变更说明

  • 未显式设置环境变量时,将默认启用上述优化能力
  • 如需保持旧行为,可手动设置:
    • FD_ENABLE_E2W_TENSOR_CONVERT=0
    • FD_ENGINE_TASK_QUEUE_WITH_SHM=0
  • 在容器环境中需确保 /dev/shm 空间充足(建议 ≥ 1GB,视模型规模而定)

Usage or Command

默认无需额外配置,升级后自动生效。

如需关闭相关功能,可通过环境变量控制:

export FD_ENABLE_E2W_TENSOR_CONVERT=0
export FD_ENGINE_TASK_QUEUE_WITH_SHM=0

Docker 使用示例(配置共享内存):

docker run --shm-size=1g ...

Accuracy Tests

本次修改仅涉及环境变量默认值调整,不涉及模型计算逻辑变更。

验证结果:

  • ✅ 功能验证:服务启动、推理流程正常
  • ✅ 性能验证:E2W tensor convert 与 SHM queue 正常工作
  • ✅ 一致性验证:关闭开关(设为0)后结果与旧版本一致(无精度差异)

Checklist

  • Add at least one tag in PR title (e.g., [FDConfig])
  • Code formatted and pre-commit passed
  • Unit tests added
    • 原因:本次修改仅为默认配置变更,无新增逻辑路径
  • Accuracy results provided
  • Backward compatibility considered(可通过环境变量回退)
  • Not a Cherry-Pick PR / OR follows Cherry-Pick rules if applicable

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 7, 2026

Thanks for your contribution!

@sunlei1024 sunlei1024 changed the title [test] Stop server with /dev/shm cleanup [FDConfig] 默认开启 FD_ENABLE_E2W_TENSOR_CONVERT 和 FD_ENGINE_TASK_QUEUE_WITH_SHM May 7, 2026
PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 7, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-09 23:38:26

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前有 1 个 Required 任务失败(Approval 审批未通过),3 个 Required 任务运行中,请等待所有 Required 任务完成后再合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 30 1 3 2 0

2 任务状态汇总

2.1 Required任务 : 6/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 9s 环境问题:PR 未获所需审批,exit code 6 请相关 Reviewer 审批此 PR Job -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - Job -
Extracted partial CE model tasks to run in CI. / run_ce_cases - 运行中 - Job -
xpu_4cards_case_test / run_xpu_4cards_cases - 运行中 - Job -
其余 6 个必选任务通过 - - - - -

2.2 可选任务 — 24/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
⏸️ CI_HPU - - -
⏸️ Run iluvatar Tests / run_iluvatar_cases - - -
其余 24 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 基础设施(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 基础设施
  • 置信度: 高
  • 根因摘要: PR 未获得所需审批,Approval 检查以 exit code 6 退出
  • 分析器: 通用分析(fallback)

根因详情:
Approval 工作流是 GitHub 审批检查流程,耗时仅 9 秒即失败,说明该 PR 尚未获得所需 Reviewer 的审批(exit code 6 通常表示"审批条件未满足")。此失败与 PR 代码变更内容无关,属于流程性问题,无需修改代码。

关键日志:

Process completed with exit code 6.
(.github line 28)

修复建议:

  1. 请相关 Reviewer 对 PR 进行 Review 并 Approve
  2. 确认所有必要的 Reviewer 已完成审批后,Approval 检查将自动重新运行并通过

修复建议摘要: 请 Reviewer 审批此 PR

链接: 查看日志

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 7, 2026

Codecov Report

❌ Patch coverage is 8.33333% with 11 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@e67ed1a). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...stdeploy/inter_communicator/engine_worker_queue.py 11.11% 8 Missing ⚠️
fastdeploy/engine/common_engine.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7746   +/-   ##
==========================================
  Coverage           ?   72.41%           
==========================================
  Files              ?      396           
  Lines              ?    55708           
  Branches           ?     8706           
==========================================
  Hits               ?    40341           
  Misses             ?    12606           
  Partials           ?     2761           
Flag Coverage Δ
GPU 72.41% <8.33%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@sunlei1024 sunlei1024 deployed to Metax_ci May 9, 2026 15:10 — with GitHub Actions Active
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-09 23:20:19

📋 Review 摘要

PR 概述:将 FD_ENABLE_E2W_TENSOR_CONVERTFD_ENGINE_TASK_QUEUE_WITH_SHM 两个环境变量的默认值从 0 改为 1,并为 EngineWorkerQueue 新增队列健康检测方法,同时重构测试工具函数
变更范围fastdeploy/envs.pyfastdeploy/engine/common_engine.pyfastdeploy/inter_communicator/engine_worker_queue.py、测试文件
影响面 Tag[FDConfig] [Engine]

📝 PR 规范检查

PR 标题使用了官方 Tag [FDConfig],格式合规;## Motivation## Modifications## Usage or Command## Accuracy Tests## Checklist 五个必填段落均已填写,结构符合模板要求。✓

问题

级别 文件 概述
🟡 建议 tests/ci_use/EB_Lite_with_adapter/test_eblite_serving.py:93 rm -rf /dev/shm/* 清除全局 /dev/shm,在 CI 共享环境中可能误删其他进程的 socket 文件
❓ 疑问 fastdeploy/inter_communicator/engine_worker_queue.py:859 is_broken()except Exception: return False 分支无任何日志,未知异常会被静默吞掉
❓ 疑问 fastdeploy/envs.py:178 FD_ENGINE_TASK_QUEUE_WITH_SHM=1 默认开启后,若 /dev/shm 空间不足(Docker 默认仅 64 MB),启动时无容量预检或明确报错

总体评价

本次变更逻辑清晰,性能优化意图明确,代码质量良好。有两处细节建议关注:测试中的全局 /dev/shm 清理范围过大,以及 is_broken() 的静默异常处理;此外建议在文档或启动日志中补充 SHM 容量要求提示。

"""
# 清理/dev/shm中的临时文件
try:
subprocess.run("rm -rf /dev/shm/*", shell=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 rm -rf /dev/shm/* 会清除 /dev/shm 目录下的所有文件,在多测试并发或 CI 共享环境下可能误删其他进程的 unix socket 文件。

建议改为只清理本测试相关的 socket 文件:

import glob
for sock in glob.glob("/dev/shm/fd_task_queue_*.sock"):
    try:
        os.remove(sock)
    except OSError:
        pass

或直接复用本 PR 引入的 cleanup_unix_socket 函数,按 PORTS_TO_CLEAN 逐一清理对应 socket。

except (ConnectionRefusedError, ConnectionResetError, BrokenPipeError, EOFError, OSError):
llm_logger.error("Failed to connect to engine worker queue")
return True
except Exception:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 except Exception: return False 分支无任何日志输出,遇到预期外异常(如 AuthenticationErrorMemoryError)时会静默返回 False(表示队列正常),可能掩盖真实故障。

建议补充日志:

except Exception as e:
    llm_logger.warning("Unexpected exception in is_broken check: %s", e)
    return False

Comment thread fastdeploy/envs.py
"FD_ENABLE_E2W_TENSOR_CONVERT": lambda: int(os.getenv("FD_ENABLE_E2W_TENSOR_CONVERT", "0")),
"FD_ENGINE_TASK_QUEUE_WITH_SHM": lambda: int(os.getenv("FD_ENGINE_TASK_QUEUE_WITH_SHM", "0")),
"FD_ENABLE_E2W_TENSOR_CONVERT": lambda: int(os.getenv("FD_ENABLE_E2W_TENSOR_CONVERT", "1")),
"FD_ENGINE_TASK_QUEUE_WITH_SHM": lambda: int(os.getenv("FD_ENGINE_TASK_QUEUE_WITH_SHM", "1")),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 FD_ENGINE_TASK_QUEUE_WITH_SHM 默认值改为 1 后,Docker 容器默认 /dev/shm 仅有 64 MB,大模型场景下可能导致 socket 通信失败且错误信息不直观。

建议在 EngineWorkerQueue.__init__ 初始化时加入容量预检(或至少输出 warning):

import shutil
shm_free = shutil.disk_usage("/dev/shm").free
if shm_free < 1 * 1024 ** 3:  # 1 GB
    logger.warning("/dev/shm free space %d MB < 1 GB, consider --shm-size=1g", shm_free // 1024**2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants