feat(cloud-agent): classify retryable runtime failures#3825
Conversation
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Executive SummaryFull incremental review of the three-commit feature ( Key Design Points (verified correct)
Files Reviewed (15 files — incremental diff)
Reviewed by claude-4.6-sonnet-20260217 · 6,789,679 tokens Review guidance: REVIEW.md from base branch |
8dff2d2 to
aadf64e
Compare
| | 'assistant_error' | ||
| | 'wrapper_error_after_activity' | ||
| | 'missing_assistant_reply' | ||
| | 'payment_required' |
There was a problem hiding this comment.
@kilocode-bot Why was this removed? What replaced this?
There was a problem hiding this comment.
Short answer: 'payment_required' was added, not removed. The + in the diff hunk means this is a new member being inserted into the union — nothing above or below it was removed.
Why it was added
The DB enum is bound to the runtime types via a compile-time contract in session-message-state.ts:
type CodeContractMatchesReport = AssertTrue<
CloudAgentRunFailureCode extends CloudAgentFailureCode ? true : false
>;This PR adds payment_required in lockstep to all layers — packages/db/src/schema.ts (here), packages/worker-utils/src/cloud-agent-failure.ts, and safe-failure-projection.ts. If it were added to CLOUD_AGENT_FAILURE_CODES but not to the DB type, the AssertTrue contract would fail to compile. The schema addition is required for the Durable Object to carry a typed failureCode: 'payment_required' at all.
What it does
Previously, insufficient-credits failures were flattened into assistant_error (indistinguishable from any generic error). This PR introduces a typed wrapper signal (WrapperTerminalFailureCodes = ['payment_required', 'model_missing']) so the wrapper can now communicate the precise failure, which terminal-error-projector.ts classifies as non-retryable — feeding the new clientError: { code, message, retryable } contract on callbacks and getMessageResult.
The subtlety
payment_required is intentionally not persisted to the DB column. In telemetry/queue-reports.ts, persistedFailureCode() normalizes it away before writing to failure_code:
function persistedFailureCode(state) {
if (state.failureCode !== 'model_missing' && state.failureCode !== 'payment_required')
return state.failureCode;
if (state.failureStage === 'agent_activity') return 'assistant_error';
if (state.failureStage === 'post_dispatch_no_activity') return 'wrapper_error_before_activity';
return state.failureCode;
}So if the concern is "this enum value will never actually be written to the DB column" — that's accurate and intentional by design. The value exists in CloudAgentSessionRunFailureCode solely to satisfy the shared contract the DO runtime depends on, not because the column is expected to store it.
Summary
{ code, message, retryable }client-error contract across Cloud Agent tRPC responses, terminal callbacks, and message-result polling.payment_requiredandmodel_missingfailures through wrapper ingest and expose optionalfailureStageso callback clients can choose their recovery strategy.Verification
Visual Changes
N/A
Reviewer Notes
retryable: truepermits a new logical attempt with a newmessageId; it does not automatically replay work or make a terminal message ID reusable.agent_activitymay have partial side effects. The callback'sfailureStagelets clients choose between replaying, continuing, inspecting state, or requesting user input.