Skip to content

feat(cloud-agent): classify retryable runtime failures#3825

Open
eshurakov wants to merge 3 commits into
mainfrom
boiled-timbale
Open

feat(cloud-agent): classify retryable runtime failures#3825
eshurakov wants to merge 3 commits into
mainfrom
boiled-timbale

Conversation

@eshurakov

@eshurakov eshurakov commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add a shared { code, message, retryable } client-error contract across Cloud Agent tRPC responses, terminal callbacks, and message-result polling.
  • Treat agent, provider, wrapper, network, and timeout failures as retryable regardless of lifecycle stage, while keeping deterministic model, payment, input, metadata, and user-interruption failures non-retryable.
  • Preserve typed payment_required and model_missing failures through wrapper ingest and expose optional failureStage so callback clients can choose their recovery strategy.
  • Keep Code Review, Security Auto Analysis, Auto Fix, bot, and webhook consumer policies unchanged; the new callback fields remain additive and can be adopted separately.

Verification

  • Manual verification not performed; this is a backend-only contract and classification change.

Visual Changes

N/A

Reviewer Notes

  • retryable: true permits a new logical attempt with a new messageId; it does not automatically replay work or make a terminal message ID reusable.
  • Failures after agent_activity may have partial side effects. The callback's failureStage lets clients choose between replaying, continuing, inspecting state, or requesting user input.
  • Database telemetry keeps its existing stage/code vocabulary: typed payment/model failures are normalized only for persistence reports, while callbacks and message results retain the exact classification.

@kilo-code-bot

kilo-code-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Executive Summary

Full incremental review of the three-commit feature (feat(cloud-agent): expose structured retryable errors, fix(cloud-agent): preserve classified assistant failures, fix(cloud-agent): preserve consumer compatibility). The PR correctly classifies payment_required and model_missing as structured failure codes throughout the stack — from wrapper event detection through DO state storage, callback payload, and DB persistence — with no logic errors, security concerns, or runtime hazards found.

Key Design Points (verified correct)
  • DO SQLite state stores payment_required / model_missing as failureCode in agent_activity or post_dispatch_no_activity stages — intentionally broader than what the PG DB allows.
  • persistedFailureCode() translation in queue-reports.ts maps these codes to assistant_error / wrapper_error_before_activity respectively before the queue message is emitted, keeping the DB constraint (cloud_agent_session_runs_failure_classification_check) satisfied without any migration.
  • DB constraint is unchanged from main (migration 0163 that dropped it was reverted in the fix commit; schema.ts now explicitly mirrors the existing constraint to prevent Drizzle drift).
  • projectTerminalClientError() correctly marks payment_required and model_missing as retryable: false in client error responses.
  • Consumer schema changes (webhook-agent-ingest, security-auto-analysis, code-review-status route) correctly remove the now-unused clientError / failureStage fields that were previously parsed from callback payloads.
  • trpc-error.ts rewrite cleanly replaces the old unreachable type guard (previously flagged) with a well-tested projectTrpcErrorData + createClientError approach.
  • Previous inline warnings on wrapper-supervisor.ts:984, trpc-error.ts, code-review-status route, and security-auto-analysis callbacks are all resolved by this PR.
Files Reviewed (15 files — incremental diff)
  • packages/db/src/schema.ts — adds payment_required to CloudAgentSessionRunFailureCode type; restores explicit check constraint mirroring the existing DB constraint
  • packages/worker-utils/src/client-error.ts — new ClientError / PublicErrorCodeSchema wire contract
  • packages/worker-utils/src/client-error.test.ts — new tests
  • packages/worker-utils/src/cloud-agent-failure.ts — adds payment_required to CLOUD_AGENT_FAILURE_CODES
  • packages/worker-utils/src/cloud-agent-queue-report.ts — removes payment_required/model_missing from CloudAgentRunFailureClassifications
  • services/cloud-agent-next/src/callbacks/types.ts — adds failureStage and clientError to ExecutionCallbackPayload
  • services/cloud-agent-next/src/persistence/CloudAgentSession.ts — adds clientError to legacy callback payload
  • services/cloud-agent-next/src/session/terminal-error-projector.ts — new module; projectTerminalClientError and isTerminalFailureRetryable
  • services/cloud-agent-next/src/session/message-settlement-outbox.ts — adds failureStage and clientError to settlement callback payload
  • services/cloud-agent-next/src/session/message-result.ts — adds retryable to SafeMessageResult['failure']
  • services/cloud-agent-next/src/session/safe-failure-projection.ts — adds payment_required generic message
  • services/cloud-agent-next/src/session/wrapper-supervisor.ts — preserves terminalFailureCode in both assistant and non-assistant failure branches
  • services/cloud-agent-next/src/telemetry/queue-reports.tspersistedFailureCode() maps payment_required/model_missing before DB write
  • services/cloud-agent-next/src/trpc-error.ts — full rewrite adding createClientError / projectTrpcErrorData
  • services/cloud-agent-next/src/websocket/ingest.ts — forwards failureCode from stream error events to the supervisor
  • services/cloud-agent-next/wrapper/src/connection.tsgetTerminalFailure() replaces isTerminalError(), returning structured failure with code
  • services/cloud-agent-next/wrapper/src/main.ts — sends failureCode in ingest stream message
  • services/auto-triage-infra/src/types.ts — removes clientError / failureStage from callback schema
  • services/security-auto-analysis/src/callbacks.ts — removes clientError from failure mapping
  • services/webhook-agent-ingest/src/routes/callbacks.ts — removes clientError / failureStage from callback schema
  • apps/web/src/app/api/internal/code-review-status/[reviewId]/route.ts — removes clientError from callback type and normalizer
  • Various test files — comprehensive coverage for all new paths

Reviewed by claude-4.6-sonnet-20260217 · 6,789,679 tokens

Review guidance: REVIEW.md from base branch main

Comment thread services/cloud-agent-next/src/trpc-error.ts Outdated
@eshurakov eshurakov changed the title feat(cloud-agent): expose structured retryable errors feat(cloud-agent): classify retryable runtime failures Jun 9, 2026
Comment thread apps/web/src/app/api/internal/code-review-status/[reviewId]/route.ts Outdated
Comment thread services/security-auto-analysis/src/callbacks.ts Outdated
Comment thread services/cloud-agent-next/src/session/wrapper-supervisor.ts
@eshurakov eshurakov requested a review from alex-alecu June 11, 2026 14:45
Comment thread packages/db/src/schema.ts
| 'assistant_error'
| 'wrapper_error_after_activity'
| 'missing_assistant_reply'
| 'payment_required'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kilocode-bot Why was this removed? What replaced this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer: 'payment_required' was added, not removed. The + in the diff hunk means this is a new member being inserted into the union — nothing above or below it was removed.

Why it was added

The DB enum is bound to the runtime types via a compile-time contract in session-message-state.ts:

type CodeContractMatchesReport = AssertTrue<
  CloudAgentRunFailureCode extends CloudAgentFailureCode ? true : false
>;

This PR adds payment_required in lockstep to all layers — packages/db/src/schema.ts (here), packages/worker-utils/src/cloud-agent-failure.ts, and safe-failure-projection.ts. If it were added to CLOUD_AGENT_FAILURE_CODES but not to the DB type, the AssertTrue contract would fail to compile. The schema addition is required for the Durable Object to carry a typed failureCode: 'payment_required' at all.

What it does

Previously, insufficient-credits failures were flattened into assistant_error (indistinguishable from any generic error). This PR introduces a typed wrapper signal (WrapperTerminalFailureCodes = ['payment_required', 'model_missing']) so the wrapper can now communicate the precise failure, which terminal-error-projector.ts classifies as non-retryable — feeding the new clientError: { code, message, retryable } contract on callbacks and getMessageResult.

The subtlety

payment_required is intentionally not persisted to the DB column. In telemetry/queue-reports.ts, persistedFailureCode() normalizes it away before writing to failure_code:

function persistedFailureCode(state) {
  if (state.failureCode !== 'model_missing' && state.failureCode !== 'payment_required')
    return state.failureCode;
  if (state.failureStage === 'agent_activity') return 'assistant_error';
  if (state.failureStage === 'post_dispatch_no_activity') return 'wrapper_error_before_activity';
  return state.failureCode;
}

So if the concern is "this enum value will never actually be written to the DB column" — that's accurate and intentional by design. The value exists in CloudAgentSessionRunFailureCode solely to satisfy the shared contract the DO runtime depends on, not because the column is expected to store it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants