fix(iluvatar): move memory query out of health check condition#13
fix(iluvatar): move memory query out of health check condition#13stezpy wants to merge 1 commit into
Conversation
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
There was a problem hiding this comment.
Code Review
This pull request modifies the Iluvatar GPU detector to retrieve device memory information regardless of whether the health check is enabled. The reviewer suggested replacing the use of contextlib.suppress with a try...except block that logs a debug exception if retrieving the memory info fails, preventing silent failures and improving troubleshooting.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| with contextlib.suppress(pyixml.NVMLError): | ||
| dev_mem_info = pyixml.nvmlDeviceGetMemoryInfo(dev) | ||
| dev_mem = byte_to_mebibyte( # byte to MiB | ||
| dev_mem_info.total, | ||
| ) | ||
| dev_mem_used = byte_to_mebibyte( # byte to MiB | ||
| dev_mem_info.used, | ||
| ) |
There was a problem hiding this comment.
在获取显存信息失败时,使用 contextlib.suppress 会静默忽略错误。如果显存获取失败,Web UI 上将显示为 0 且没有任何错误日志,这会给排查问题带来困难。
建议使用 try...except pyixml.NVMLError 并在 except 块中调用 debug_log_exception 记录调试日志。这与代码库中其他地方的处理方式(例如获取设备间拓扑距离失败时的处理)保持一致,有利于后续的维护和问题排查。
| with contextlib.suppress(pyixml.NVMLError): | |
| dev_mem_info = pyixml.nvmlDeviceGetMemoryInfo(dev) | |
| dev_mem = byte_to_mebibyte( # byte to MiB | |
| dev_mem_info.total, | |
| ) | |
| dev_mem_used = byte_to_mebibyte( # byte to MiB | |
| dev_mem_info.used, | |
| ) | |
| try: | |
| dev_mem_info = pyixml.nvmlDeviceGetMemoryInfo(dev) | |
| dev_mem = byte_to_mebibyte( # byte to MiB | |
| dev_mem_info.total, | |
| ) | |
| dev_mem_used = byte_to_mebibyte( # byte to MiB | |
| dev_mem_info.used, | |
| ) | |
| except pyixml.NVMLError: | |
| debug_log_exception( | |
| logger, | |
| "Failed to get memory info for device %d", | |
| dev_index, | |
| ) |
There was a problem hiding this comment.
由于原来代码里也没有这么处理,希望代码审核时给个意见,这边都可以配合改
在2.1.2版本上发现添加的worker节点不能正常读取天数智芯gpu信息
与runtime中的iluvatar.py中的显存检测有关系,默认的条件下会跳过显存检测导致web ui上不显示gpu信息
目前对gpu显存检测流程作了修改