Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions gpustack_runtime/detector/iluvatar.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,15 +137,16 @@ def detect(self) -> Devices | None:
dev_mem = 0
dev_mem_used = 0
dev_mem_status = DeviceMemoryStatusEnum.HEALTHY
with contextlib.suppress(pyixml.NVMLError):
dev_mem_info = pyixml.nvmlDeviceGetMemoryInfo(dev)
dev_mem = byte_to_mebibyte( # byte to MiB
dev_mem_info.total,
)
dev_mem_used = byte_to_mebibyte( # byte to MiB
dev_mem_info.used,
)
Comment on lines +140 to +147
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

在获取显存信息失败时,使用 contextlib.suppress 会静默忽略错误。如果显存获取失败,Web UI 上将显示为 0 且没有任何错误日志,这会给排查问题带来困难。

建议使用 try...except pyixml.NVMLError 并在 except 块中调用 debug_log_exception 记录调试日志。这与代码库中其他地方的处理方式(例如获取设备间拓扑距离失败时的处理)保持一致,有利于后续的维护和问题排查。

Suggested change
with contextlib.suppress(pyixml.NVMLError):
dev_mem_info = pyixml.nvmlDeviceGetMemoryInfo(dev)
dev_mem = byte_to_mebibyte( # byte to MiB
dev_mem_info.total,
)
dev_mem_used = byte_to_mebibyte( # byte to MiB
dev_mem_info.used,
)
try:
dev_mem_info = pyixml.nvmlDeviceGetMemoryInfo(dev)
dev_mem = byte_to_mebibyte( # byte to MiB
dev_mem_info.total,
)
dev_mem_used = byte_to_mebibyte( # byte to MiB
dev_mem_info.used,
)
except pyixml.NVMLError:
debug_log_exception(
logger,
"Failed to get memory info for device %d",
dev_index,
)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

由于原来代码里也没有这么处理,希望代码审核时给个意见,这边都可以配合改

if not envs.GPUSTACK_RUNTIME_DETECT_NO_HEALTH_CHECK:
with contextlib.suppress(pyixml.NVMLError):
dev_mem_info = pyixml.nvmlDeviceGetMemoryInfo(dev)
dev_mem = byte_to_mebibyte( # byte to MiB
dev_mem_info.total,
)
dev_mem_used = byte_to_mebibyte( # byte to MiB
dev_mem_info.used,
)
dev_health = pyixml.ixmlDeviceGetHealth(dev)
if dev_health != pyixml.IXML_HEALTH_OK:
dev_mem_status = DeviceMemoryStatusEnum.UNHEALTHY
Expand Down