Skip to content

[PyTorch] CPU overhead optimizations for te autocast#2957

Open
vthumbe1503 wants to merge 6 commits into
NVIDIA:mainfrom
vthumbe1503:cpu_opt_te_autocast
Open

[PyTorch] CPU overhead optimizations for te autocast#2957
vthumbe1503 wants to merge 6 commits into
NVIDIA:mainfrom
vthumbe1503:cpu_opt_te_autocast

Conversation

@vthumbe1503
Copy link
Copy Markdown
Collaborator

@vthumbe1503 vthumbe1503 commented May 4, 2026

Description

te-autocast has quite a bit of CPU overheads on Grace Systems.
Here are the results on GB200 after the optimizations

  • Without Optimizations
    image

  • Optimization1 --> Cache recipe string representation for getting unique autocast key. Also directly accessing the cached representation of recipe is 40 ns faster than going through a str(recipe)

image

  • Optimization2 --> Use enter, exit methods instead of using contextlib.contextmanager
image

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 4, 2026

Greptile Summary

This PR reduces CPU overhead in the TE autocast path on Grace/GB200 systems via two optimizations: caching the string representation of recipe objects to avoid repeated formatting work, and replacing @contextmanager generator-based context managers with explicit __enter__/__exit__ class implementations to cut contextlib overhead.

  • Recipe base class gains a lazy _cached_repr field (invalidated on any __setattr__) and a __repr__ that rebuilds it via the new abstract _make_repr(); frozen dataclasses MMParams/QParams compute their repr once in __post_init__.
  • autocast is converted from a @contextmanager function to a class with __slots__, an explicit concurrent-reuse guard, and get_unique_autocast_key reads the already-cached repr directly from recipe.__dict__ to avoid an extra Python call.
  • fp8_autocast is simplified to a thin factory that returns an autocast instance, preserving the existing context-manager contract.

Confidence Score: 5/5

Safe to merge; all changed code paths are well-guarded and the optimizations are straightforward caching / class-conversion changes with no behavioural regressions.

The caching logic is correct: _cached_repr is invalidated on every __setattr__ for mutable recipes and computed eagerly in __post_init__ for frozen dataclasses. The autocast conversion preserves context-manager semantics and adds a correct concurrent-reuse guard. The only notable gap is that @abc.abstractmethod on Recipe._make_repr is effectively decorative without abc.ABC inheritance, but all existing subclasses already implement the method so there is no current breakage.

No files require special attention; transformer_engine/common/recipe/__init__.py has a minor design gap in the abstract method declaration that does not affect existing behaviour.

Important Files Changed

Filename Overview
transformer_engine/common/recipe/init.py Adds lazy cached repr to Recipe base class and __post_init__-based caching to frozen dataclasses MMParams/QParams; all recipe subclasses renamed __repr___make_repr and dataclasses annotated with repr=False. @abc.abstractmethod on _make_repr is a no-op because Recipe doesn't inherit from abc.ABC.
transformer_engine/pytorch/quantization.py Converts fp8_autocast to a thin factory returning the new class-based autocast context manager; adds __slots__, a concurrent-reuse guard (_fp8_state is not None), and switches get_unique_autocast_key from @classmethod to @staticmethod with an optimised cached-repr lookup.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["fp8_autocast(...)"] -->|"returns autocast instance"| B["autocast.__init__"]
    B --> C["with autocast(...)"]
    C --> D["autocast.__enter__"]
    D --> E{"_fp8_state is not None?"}
    E -->|"Yes"| F["raise RuntimeError\n(concurrent reuse guard)"]
    E -->|"No"| G["check_recipe_support"]
    G --> H["get_autocast_state → _fp8_state"]
    H --> I["autocast_enter\n(set FP8 global state)"]
    I --> J["yield control to user code"]
    J --> K["autocast.__exit__"]
    K --> L["set_autocast_state(_fp8_state)"]
    L --> M["autocast_exit"]
    M --> N["_fp8_state = None\n(allows sequential reuse)"]
Loading

Reviews (4): Last reviewed commit: "Merge branch 'main' into cpu_opt_te_auto..." | Re-trigger Greptile

Comment thread transformer_engine/pytorch/quantization.py Outdated
Comment thread transformer_engine/pytorch/quantization.py Outdated
Comment thread transformer_engine/common/recipe/__init__.py Outdated
Comment thread transformer_engine/common/recipe/__init__.py Outdated
Comment thread transformer_engine/common/recipe/__init__.py
Comment thread transformer_engine/pytorch/quantization.py Outdated
Comment thread transformer_engine/pytorch/quantization.py Outdated
Comment thread transformer_engine/pytorch/quantization.py Outdated
Comment thread transformer_engine/pytorch/quantization.py Outdated
vthumbe1503 and others added 5 commits May 8, 2026 02:58
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
@vthumbe1503
Copy link
Copy Markdown
Collaborator Author

/te-ci pytorch

@vthumbe1503 vthumbe1503 changed the title CPU overhead optimizations for te autocast [PyTorch] CPU overhead optimizations for te autocast May 12, 2026
Copy link
Copy Markdown
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, with one simplification.

Comment on lines +593 to +597
recipe_repr = recipe.__dict__.get("_cached_repr") if recipe is not None else None
if recipe_repr is None:
recipe_repr = str(recipe)
group_id = id(group) if group is not None else None
return f"recipe={recipe_repr},group={group_id}"
Copy link
Copy Markdown
Collaborator

@timmoon10 timmoon10 May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simplify now that the recipe handles caching internally:

Suggested change
recipe_repr = recipe.__dict__.get("_cached_repr") if recipe is not None else None
if recipe_repr is None:
recipe_repr = str(recipe)
group_id = id(group) if group is not None else None
return f"recipe={recipe_repr},group={group_id}"
group_id = id(group) if group is not None else None
return f"recipe=({recipe}),group={group_id}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants