[PyTorch] CPU overhead optimizations for te autocast by vthumbe1503 · Pull Request #2957 · NVIDIA/TransformerEngine

vthumbe1503 · 2026-05-04T15:42:30Z

Description

te-autocast has quite a bit of CPU overheads on Grace Systems.
Here are the results on GB200 after the optimizations

Without Optimizations
Optimization1 --> Cache recipe string representation for getting unique autocast key. Also directly accessing the cached representation of recipe is 40 ns faster than going through a str(recipe)

Optimization2 --> Use enter, exit methods instead of using contextlib.contextmanager

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

greptile-apps · 2026-05-04T15:47:11Z

Greptile Summary

This PR reduces CPU overhead in the TE autocast path on Grace/GB200 systems via two optimizations: caching the string representation of recipe objects to avoid repeated formatting work, and replacing @contextmanager generator-based context managers with explicit __enter__/__exit__ class implementations to cut contextlib overhead.

Recipe base class gains a lazy _cached_repr field (invalidated on any __setattr__) and a __repr__ that rebuilds it via the new abstract _make_repr(); frozen dataclasses MMParams/QParams compute their repr once in __post_init__.
autocast is converted from a @contextmanager function to a class with __slots__, an explicit concurrent-reuse guard, and get_unique_autocast_key reads the already-cached repr directly from recipe.__dict__ to avoid an extra Python call.
fp8_autocast is simplified to a thin factory that returns an autocast instance, preserving the existing context-manager contract.

Confidence Score: 5/5

Safe to merge; all changed code paths are well-guarded and the optimizations are straightforward caching / class-conversion changes with no behavioural regressions.

The caching logic is correct: _cached_repr is invalidated on every __setattr__ for mutable recipes and computed eagerly in __post_init__ for frozen dataclasses. The autocast conversion preserves context-manager semantics and adds a correct concurrent-reuse guard. The only notable gap is that @abc.abstractmethod on Recipe._make_repr is effectively decorative without abc.ABC inheritance, but all existing subclasses already implement the method so there is no current breakage.

No files require special attention; transformer_engine/common/recipe/__init__.py has a minor design gap in the abstract method declaration that does not affect existing behaviour.

Important Files Changed

Filename	Overview
transformer_engine/common/recipe/init.py	Adds lazy cached repr to `Recipe` base class and `__post_init__`-based caching to frozen dataclasses `MMParams`/`QParams`; all recipe subclasses renamed `__repr__` → `_make_repr` and dataclasses annotated with `repr=False`. `@abc.abstractmethod` on `_make_repr` is a no-op because `Recipe` doesn't inherit from `abc.ABC`.
transformer_engine/pytorch/quantization.py	Converts `fp8_autocast` to a thin factory returning the new class-based `autocast` context manager; adds `__slots__`, a concurrent-reuse guard (`_fp8_state is not None`), and switches `get_unique_autocast_key` from `@classmethod` to `@staticmethod` with an optimised cached-repr lookup.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["fp8_autocast(...)"] -->|"returns autocast instance"| B["autocast.__init__"]
    B --> C["with autocast(...)"]
    C --> D["autocast.__enter__"]
    D --> E{"_fp8_state is not None?"}
    E -->|"Yes"| F["raise RuntimeError\n(concurrent reuse guard)"]
    E -->|"No"| G["check_recipe_support"]
    G --> H["get_autocast_state → _fp8_state"]
    H --> I["autocast_enter\n(set FP8 global state)"]
    I --> J["yield control to user code"]
    J --> K["autocast.__exit__"]
    K --> L["set_autocast_state(_fp8_state)"]
    L --> M["autocast_exit"]
    M --> N["_fp8_state = None\n(allows sequential reuse)"]

_{Reviews (4): Last reviewed commit: "Merge branch 'main' into cpu_opt_te_auto..." | Re-trigger Greptile}

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 · 2026-05-11T02:18:27Z

/te-ci pytorch

timmoon10

Overall LGTM, with one simplification.

timmoon10 · 2026-05-12T00:41:36Z

+        recipe_repr = recipe.__dict__.get("_cached_repr") if recipe is not None else None
+        if recipe_repr is None:
+            recipe_repr = str(recipe)
+        group_id = id(group) if group is not None else None
+        return f"recipe={recipe_repr},group={group_id}"


We can simplify now that the recipe handles caching internally:

Suggested change

recipe_repr = recipe.__dict__.get("_cached_repr") if recipe is not None else None

if recipe_repr is None:

recipe_repr = str(recipe)

group_id = id(group) if group is not None else None

return f"recipe={recipe_repr},group={group_id}"

group_id = id(group) if group is not None else None

return f"recipe=({recipe}),group={group_id}"

cpu optimizations for te autocast

49fa7f0

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

greptile-apps Bot reviewed May 4, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/quantization.py Outdated

Comment thread transformer_engine/pytorch/quantization.py Outdated

Comment thread transformer_engine/common/recipe/__init__.py Outdated

timmoon10 reviewed May 4, 2026

View reviewed changes

vthumbe1503 and others added 5 commits May 8, 2026 02:58

address some review comments

b709115

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f3229ea

for more information, see https://pre-commit.ci

address review comments

48b1996

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

clean comments

0090f36

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'main' into cpu_opt_te_autocast

a0fff78

ptrendx added the cpu_overhead label May 11, 2026

vthumbe1503 changed the title ~~CPU overhead optimizations for te autocast~~ [PyTorch] CPU overhead optimizations for te autocast May 12, 2026

timmoon10 approved these changes May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] CPU overhead optimizations for te autocast#2957

[PyTorch] CPU overhead optimizations for te autocast#2957
vthumbe1503 wants to merge 6 commits into
NVIDIA:mainfrom
vthumbe1503:cpu_opt_te_autocast

vthumbe1503 commented May 4, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vthumbe1503 commented May 11, 2026

Uh oh!

timmoon10 left a comment

Uh oh!

timmoon10 May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vthumbe1503 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vthumbe1503 commented May 11, 2026

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

timmoon10 May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vthumbe1503 commented May 4, 2026 •

edited

Loading

greptile-apps Bot commented May 4, 2026 •

edited

Loading

timmoon10 May 12, 2026 •

edited

Loading