[PyTorch] CPU overhead optimizations for te autocast#2957
Conversation
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Greptile SummaryThis PR reduces CPU overhead in the TE autocast path on Grace/GB200 systems via two optimizations: caching the string representation of recipe objects to avoid repeated formatting work, and replacing
Confidence Score: 5/5Safe to merge; all changed code paths are well-guarded and the optimizations are straightforward caching / class-conversion changes with no behavioural regressions. The caching logic is correct: No files require special attention; Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["fp8_autocast(...)"] -->|"returns autocast instance"| B["autocast.__init__"]
B --> C["with autocast(...)"]
C --> D["autocast.__enter__"]
D --> E{"_fp8_state is not None?"}
E -->|"Yes"| F["raise RuntimeError\n(concurrent reuse guard)"]
E -->|"No"| G["check_recipe_support"]
G --> H["get_autocast_state → _fp8_state"]
H --> I["autocast_enter\n(set FP8 global state)"]
I --> J["yield control to user code"]
J --> K["autocast.__exit__"]
K --> L["set_autocast_state(_fp8_state)"]
L --> M["autocast_exit"]
M --> N["_fp8_state = None\n(allows sequential reuse)"]
Reviews (4): Last reviewed commit: "Merge branch 'main' into cpu_opt_te_auto..." | Re-trigger Greptile |
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
|
/te-ci pytorch |
timmoon10
left a comment
There was a problem hiding this comment.
Overall LGTM, with one simplification.
| recipe_repr = recipe.__dict__.get("_cached_repr") if recipe is not None else None | ||
| if recipe_repr is None: | ||
| recipe_repr = str(recipe) | ||
| group_id = id(group) if group is not None else None | ||
| return f"recipe={recipe_repr},group={group_id}" |
There was a problem hiding this comment.
We can simplify now that the recipe handles caching internally:
| recipe_repr = recipe.__dict__.get("_cached_repr") if recipe is not None else None | |
| if recipe_repr is None: | |
| recipe_repr = str(recipe) | |
| group_id = id(group) if group is not None else None | |
| return f"recipe={recipe_repr},group={group_id}" | |
| group_id = id(group) if group is not None else None | |
| return f"recipe=({recipe}),group={group_id}" |
Description
te-autocast has quite a bit of CPU overheads on Grace Systems.
Here are the results on GB200 after the optimizations
Without Optimizations

Optimization1 --> Cache recipe string representation for getting unique autocast key. Also directly accessing the cached representation of recipe is 40 ns faster than going through a str(recipe)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: