fix(tracing): fail open temporal span activities#437
Conversation
fe7a153 to
fa5d4a9
Compare
|
Addressed the Greptile nullable-span finding in fa5d4a9. The issue was that adk.tracing.start_span(...) can now fail open and return None, but StateMachine.reset_to_initial_state() was still mutating span.output unconditionally. Updated reset to mirror step(): initialize span = None and only set output / call end_span when a span was actually created. Added regression coverage in tests/lib/test_state_machine.py for the reset path where start_span returns None. |
fa5d4a9 to
c88b9eb
Compare
c88b9eb to
e90a03e
Compare
|
Added the span-drop metric nit in e90a03e. On Temporal tracing activity fail-open, we now emit replay-safe workflow metrics via workflow.metric_meter(): agentex.tracing.temporal_span_activity.dropped with event_type=start or event_type=end. Cancellation still propagates and is not counted as a dropped span. |
Summary
Tests
Greptile Summary
This PR makes the Temporal-path
start_spanandend_spantracing activities fail open rather than crashing the workflow when they time out or error. It also fixesStateMachine.reset_to_initial_stateto guard the span usage against the newly-possibleNonereturn, matching the pattern already present instep().start_spannow catchesActivityError/TemporalTimeoutError(re-raising cancellations) and returnsNone;end_spanreturns the original span unchanged on the same errors; both emit a boundedagentex.tracing.temporal_span_activity.droppedmetric.reset_to_initial_stateinitializesspan = Noneand gates thespan.outputassignment andend_spancall behindspan is not None, preventing anAttributeErrorwhen tracing fails open during a reset.state_machineregression test.Confidence Score: 5/5
Safe to merge — the fail-open logic is tightly scoped to Temporal activity errors, cancellations re-propagate correctly, and all direct callers of start_span now guard the None return before using the span.
The change is a focused defensive fix with no alterations to the happy path. The null-return contract is consistently handled across every direct caller (step() already guarded, reset_to_initial_state is fixed here, the span context manager uses if span:). Cancellation propagation is explicitly preserved. New tests cover the key branches including the regression path in state_machine.
No files require special attention.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[start_span / end_span called] --> B{in_temporal_workflow?} B -- No --> C[Call _tracing_service directly\nraises on error] B -- Yes --> D[execute_activity via ActivityHelpers] D -- Success --> E[Return Span] D -- ActivityError / TemporalTimeoutError --> F{is_cancelled_exception?} F -- Yes --> G[Re-raise — workflow cancellation\nmust propagate] F -- No --> H[workflow.logger.warning + metric] H --> I{start_span or end_span?} I -- start_span --> J[Return None\nfail open] I -- end_span --> K[Return original span\nfail open] J --> L[Callers guard span is not None\nbefore mutating or calling end_span] K --> M[Caller receives original span\nworkflow continues]%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% flowchart TD A[start_span / end_span called] --> B{in_temporal_workflow?} B -- No --> C[Call _tracing_service directly\nraises on error] B -- Yes --> D[execute_activity via ActivityHelpers] D -- Success --> E[Return Span] D -- ActivityError / TemporalTimeoutError --> F{is_cancelled_exception?} F -- Yes --> G[Re-raise — workflow cancellation\nmust propagate] F -- No --> H[workflow.logger.warning + metric] H --> I{start_span or end_span?} I -- start_span --> J[Return None\nfail open] I -- end_span --> K[Return original span\nfail open] J --> L[Callers guard span is not None\nbefore mutating or calling end_span] K --> M[Caller receives original span\nworkflow continues]Reviews (2): Last reviewed commit: "fix(tracing): fail open temporal span ac..." | Re-trigger Greptile