Skip to content

fix(tracing): fail open temporal span activities#437

Merged
danielmillerp merged 1 commit into
nextfrom
dm/temporal-tracing-fail-open
Jun 23, 2026
Merged

fix(tracing): fail open temporal span activities#437
danielmillerp merged 1 commit into
nextfrom
dm/temporal-tracing-fail-open

Conversation

@danielmillerp

@danielmillerp danielmillerp commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

  • make Temporal adk.tracing.start_span fail open and return None when the tracing activity fails
  • make Temporal adk.tracing.end_span fail open and return the original span when the tracing activity fails
  • add regression coverage for Temporal tracing activity timeout/failure behavior

Tests

  • rye run pytest tests/lib/adk/test_tracing_module.py -n 0 -p no:cacheprovider
  • rye run ruff check --no-cache src/agentex/lib/adk/_modules/tracing.py tests/lib/adk/test_tracing_module.py

Greptile Summary

This PR makes the Temporal-path start_span and end_span tracing activities fail open rather than crashing the workflow when they time out or error. It also fixes StateMachine.reset_to_initial_state to guard the span usage against the newly-possible None return, matching the pattern already present in step().

  • start_span now catches ActivityError/TemporalTimeoutError (re-raising cancellations) and returns None; end_span returns the original span unchanged on the same errors; both emit a bounded agentex.tracing.temporal_span_activity.dropped metric.
  • reset_to_initial_state initializes span = None and gates the span.output assignment and end_span call behind span is not None, preventing an AttributeError when tracing fails open during a reset.
  • Six new unit tests cover the temporal fail-open path, propagation of unexpected errors, propagation of cancellation signals, the context-manager short-circuit, and an integration-style state_machine regression test.

Confidence Score: 5/5

Safe to merge — the fail-open logic is tightly scoped to Temporal activity errors, cancellations re-propagate correctly, and all direct callers of start_span now guard the None return before using the span.

The change is a focused defensive fix with no alterations to the happy path. The null-return contract is consistently handled across every direct caller (step() already guarded, reset_to_initial_state is fixed here, the span context manager uses if span:). Cancellation propagation is explicitly preserved. New tests cover the key branches including the regression path in state_machine.

No files require special attention.

Important Files Changed

Filename Overview
src/agentex/lib/adk/_modules/tracing.py Adds fail-open exception handling for Temporal activity errors in start_span and end_span, with cancellation re-propagation and a bounded dropped-metric counter.
src/agentex/lib/sdk/state_machine/state_machine.py Fixes reset_to_initial_state to guard span usage with span is not None, matching the existing pattern in step(); no logic changes beyond the null guard.
tests/lib/adk/test_tracing_module.py Adds six new tests covering start/end fail-open, context-manager short-circuit, unexpected-error propagation, and cancellation propagation for both start and end paths.
tests/lib/test_state_machine.py New regression test confirming reset_to_initial_state completes successfully and skips end_span when start_span returns None.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[start_span / end_span called] --> B{in_temporal_workflow?}
    B -- No --> C[Call _tracing_service directly\nraises on error]
    B -- Yes --> D[execute_activity via ActivityHelpers]
    D -- Success --> E[Return Span]
    D -- ActivityError / TemporalTimeoutError --> F{is_cancelled_exception?}
    F -- Yes --> G[Re-raise — workflow cancellation\nmust propagate]
    F -- No --> H[workflow.logger.warning + metric]
    H --> I{start_span or end_span?}
    I -- start_span --> J[Return None\nfail open]
    I -- end_span --> K[Return original span\nfail open]
    J --> L[Callers guard span is not None\nbefore mutating or calling end_span]
    K --> M[Caller receives original span\nworkflow continues]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[start_span / end_span called] --> B{in_temporal_workflow?}
    B -- No --> C[Call _tracing_service directly\nraises on error]
    B -- Yes --> D[execute_activity via ActivityHelpers]
    D -- Success --> E[Return Span]
    D -- ActivityError / TemporalTimeoutError --> F{is_cancelled_exception?}
    F -- Yes --> G[Re-raise — workflow cancellation\nmust propagate]
    F -- No --> H[workflow.logger.warning + metric]
    H --> I{start_span or end_span?}
    I -- start_span --> J[Return None\nfail open]
    I -- end_span --> K[Return original span\nfail open]
    J --> L[Callers guard span is not None\nbefore mutating or calling end_span]
    K --> M[Caller receives original span\nworkflow continues]
Loading

Reviews (2): Last reviewed commit: "fix(tracing): fail open temporal span ac..." | Re-trigger Greptile

Comment thread src/agentex/lib/adk/_modules/tracing.py
@danielmillerp danielmillerp force-pushed the dm/temporal-tracing-fail-open branch 2 times, most recently from fe7a153 to fa5d4a9 Compare June 23, 2026 17:17
@danielmillerp danielmillerp changed the base branch from main to next June 23, 2026 17:22
@danielmillerp

Copy link
Copy Markdown
Contributor Author

Addressed the Greptile nullable-span finding in fa5d4a9.

The issue was that adk.tracing.start_span(...) can now fail open and return None, but StateMachine.reset_to_initial_state() was still mutating span.output unconditionally. Updated reset to mirror step(): initialize span = None and only set output / call end_span when a span was actually created.

Added regression coverage in tests/lib/test_state_machine.py for the reset path where start_span returns None.

@danielmillerp danielmillerp force-pushed the dm/temporal-tracing-fail-open branch from fa5d4a9 to c88b9eb Compare June 23, 2026 17:25
@danielmillerp danielmillerp force-pushed the dm/temporal-tracing-fail-open branch from c88b9eb to e90a03e Compare June 23, 2026 17:32
@danielmillerp

Copy link
Copy Markdown
Contributor Author

Added the span-drop metric nit in e90a03e.

On Temporal tracing activity fail-open, we now emit replay-safe workflow metrics via workflow.metric_meter():

agentex.tracing.temporal_span_activity.dropped

with event_type=start or event_type=end. Cancellation still propagates and is not counted as a dropped span.

@danielmillerp danielmillerp merged commit 2d63eef into next Jun 23, 2026
64 checks passed
@danielmillerp danielmillerp deleted the dm/temporal-tracing-fail-open branch June 23, 2026 17:46
@stainless-app stainless-app Bot mentioned this pull request Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants