database: split RetryPostgres into RetryIdempotent and RetryPostgresNonIdempotent by elovelan · Pull Request #2 · stevemsmith/base

elovelan · 2026-06-29T18:07:36Z

Suggested implementation of the transaction-retry pattern from the review comments on this PR:

r3470551946 — recommending an explicit transaction (with fn taking a transaction argument like pgx.BeginTxFunc) instead of an implicit one, so there are fewer failure scenarios where a retry could duplicate committed work. Also: capture pg_current_xact_id() before commit and consult pg_xact_status() on failure to see if the commit was successful.
r3487236671 — recommending two distinct functions: one with automatic safe retries (transactional), and one for idempotent callers where the function being retried must be idempotent.
r3482973525 / r3483137216 — network errors and string-matched connection errors should be removed from the standalone classifier because without a transaction there's no way to know if the query landed.

What this adds

RetryPostgresNonIdempotent (safe, transactional) — wraps fn in an explicit BeginTx/Commit/Rollback and retries the whole transaction on transient errors. fn receives a *sql.Tx each attempt. This is the shape of pgx.BeginTxFunc and pgxtransactions.Transaction, and what jackc agreed should work. Use this for non-idempotent operations. Commit verification is always on (see below).

RetryIdempotent (generic, idempotent-caller) — retries fn (no transaction wrapper) on any error by default except context.Canceled/context.DeadlineExceeded. The caller vouches that fn is idempotent. Lighter-weight option for selects and idempotent DML (no transaction overhead, no commit-verification round-trip).

RetryMySQLNonIdempotent / RetrySpannerNonIdempotent (placeholders) — return errors.New("... not yet implemented") with TODOs. They share RetryNonIdempotentOptions and the fn func(*sql.Tx) error signature so the cross-backend API surface is visible.

The existing RetryPostgres is removed (this PR is unmerged). Callers should use RetryPostgresNonIdempotent (transactional) or RetryIdempotent (idempotent).

Pre-commit classifier (opt-out) + commit verification (always-on)

Pre-commit errors (from BeginTx or fn) — opt-OUT classifier isRetryablePostgresPreCommitError: pre-commit retry is always safe (the transaction rolls back), so it retries everything except context.Canceled/context.DeadlineExceeded and the permanent SQLSTATE classes (22xxx, 23xxx, 42xxx — via pgerrcode.IsDataException/IsIntegrityConstraintViolation/IsSyntaxErrororAccessRuleViolation). Unknown *pgconn.PgError codes and non-PgError errors are retried; a false positive (~200ms wasted) costs less than a false negative (spurious user-facing failure). opts.IsRetryable is additive (OR) on this classifier.

Commit errors (from (*sql.Tx).Commit) — always verified, not classified. The server is the authority on whether a commit landed (a client-side guess can't tell committed-but-response-lost from genuinely-aborted), so instead of a "retryable commit" classifier we always check pg_xact_status. Per r3470551946 and the PostgreSQL docs (which explicitly endorse pg_xact_status for "determine whether their transaction committed or aborted after the application and database server become disconnected while a COMMIT is in progress"):

Before each Commit, capture the transaction id: SELECT pg_current_xact_id_if_assigned()::text (NULL → read-only; the ::text cast sidesteps pgx's xid8 Go type-mapping).
On a commit error, query SELECT pg_xact_status($1::xid8) on a fresh connection (the original may be dead; xid8 is globally unique so this works across connections/reconnects, modulo a failover replay-lag window):
- 'aborted' → retry (the commit didn't land).
- 'committed' → ErrCommitted (don't retry — would duplicate; the caller reconciles/alerts).
- NULL / 'in progress' / query errors → ErrCommitPhase (ambiguous; caller decides) - unlikely, but could happen during an asynchronous regional failover or true regional outage.
Read-only transactions (no xid) retry without verifying — no writes to duplicate.

A retryable commit error (verified-aborted or read-only) on exhaustion returns the original error (not ErrCommitPhase — it's known non-commit, just out of retries). ErrCommitPhase is only for inconclusive commits.

if errors.Is(err, database.ErrCommitted) {
    // Verified committed: do NOT retry (would duplicate). Reconcile/alert.
}

if errors.Is(err, database.ErrCommitPhase) {
    // Inconclusive: the commit may or may not have landed. Decide / re-check.
}

Both preserve the underlying commit error for inspection via errors.Is/errors.As.

Why no commit classifier? An earlier iteration had a "maximally conservative" commit classifier (isSafeRetryablePostgresError) that fast-pathed pgconn.SafeToRetry + class-40 commit errors to retry without verification. But commit errors are rare, and the xid-capture round-trip is already paid on every commit — so always checking pg_xact_status on a commit error (one more round-trip, only on the rare commit error) is negligible, and it removes the whole client-side guessing exercise (the 57P01/57P02/57014/57P03/53300/network/driver.ErrBadConn exclusion reasoning). Class-40 → pg_xact_status says 'aborted' → retry (same outcome, +1 round-trip); the rare SafeToRetry+failover+replay-lag case → ErrCommitPhase (recoverable). opts.IsRetryable no longer applies to commit errors — safer, since a caller shouldn't override commit verification.

Trade-off: commit verification adds one round-trip per transaction (pg_current_xact_id_if_assigned() before every commit), paid always but only used on rare commit errors. Always-on for the non-idempotent variant (where a duplicate-commit hurts most); RetryIdempotent skips it (no transaction).

Non-configurable retry budget (`maxRetryAttempts = 3`)

MaxAttempts is not configurable; both loops use maxRetryAttempts = 3 (in order to match database/sql's maxBadConnRetries+1). Use the context deadline to bound total time, or wrap for more. database/sql already retries driver.ErrBadConn for *sql.DB methods (immediate, up to 3 attempts); this outer loop layers on top with jitter for longer outages. *sql.Tx methods have no internal retry, so this is the only retry for tx.Exec/tx.Commit.

`pgerrcode` adoption

Adopted github.com/jackc/pgerrcode (canonical Postgres error-code constants, by the pgx author) to eliminate string literals and inline human-friendly-name comments: PostgresUniqueViolation/PostgresDeadlockFound use pgerrcode.UniqueViolation/pgerrcode.DeadlockDetected; isPermanentPostgresError uses pgerrcode.IsDataException/IsIntegrityConstraintViolation/IsSyntaxErrororAccessRuleViolation. Tests use pgerrcode constants throughout.

Naming

RetryIdempotent (no transaction wrapper) — for selects and idempotent DML; the caller vouches fn is idempotent. Chosen over "unsafe" so the common case doesn't put "unsafe" all over callers' codebases.
RetryPostgresNonIdempotent (transactional + commit verification) — for non-idempotent ops. The Tx suffix was dropped (the fn func(*sql.Tx) error parameter makes the transaction obvious at the call site).
RetryIdempotentOptions / RetryNonIdempotentOptions (shared by the Postgres/MySQL/Spanner non-idempotent variants) — parallel to the function names.
The internal retryTx helper keeps its name (unexported, descriptive).

Pluggable retryable cases

RetryNonIdempotentOptions.IsRetryable is additive (OR) on the pre-commit classifier (commit errors are always verified, not classified).
RetryIdempotentOptions.IsRetryable replaces the default (nil → retry any error except context cancellation/deadline).

Out of scope (follow-ups)

Implementing RetryMySQLNonIdempotent / RetrySpannerNonIdempotent — placeholders only (no pg_xact_status equivalent wired; they'd fall back to ErrCommitPhase for commit errors until each backend provides a verifier).
A sql.DB.RetryPostgresNonIdempotent method on the sql/db.go wrapper — deferred (would add a sql → database dependency).
Integration test against a real Postgres for the verification path (simulating an ambiguous commit) — the unit tests use a mock driver that scripts pg_current_xact_id_if_assigned/pg_xact_status returns; a real-PG integration test would confirm the SQL and xid8 round-tripping.

Test plan

TestRetryIdempotent — success, retries on any error by default (incl. 23505), custom IsRetryable short-circuits, no retry on context.Canceled, ctx cancellation, attempt exhaustion.
TestRetryPostgresNonIdempotent — success; fn-error-then-succeed (rollback between attempts); non-retryable fn error short-circuits (23505); driver.ErrBadConn from fn retried; SafeToRetry Begin failure retried; SafeToRetry Commit failure → verified-aborted → retry; verify-aborted → retry; verify-committed → ErrCommitted; verify-unknown (NULL) → ErrCommitPhase; verify-error → ErrCommitPhase; read-only (no xid) → retry without verifying; class-40 exhaustion → verify-aborted each attempt → original error (not ErrCommitPhase); 57P01+inconclusive → ErrCommitPhase; additive opts.IsRetryable (pre-commit); ctx cancellation; attempt exhaustion. Uses a minimal database/sql/driver mock (with QueryerContext to script the verification queries).
TestIsRetryablePostgresPreCommitError — pre-commit classifier leg coverage.
TestErrCommitPhase / TestErrCommitted — sentinel detection; underlying error preserved; distinct from each other.
TestRetryMySQLNonIdempotentNotImplemented / TestRetrySpannerNonIdempotentNotImplemented — placeholders return "not yet implemented".
go build ./..., go vet ./..., go test ./database/... -short all pass.

…etryUnsafe Implements the transaction-retry pattern suggested in PR moov-io#490 review comments (discussion_r3470551946 and discussion_r3487236671): - RetryPostgresTx wraps fn in an explicit transaction (fn receives a *sql.Tx) and retries the whole transaction on transient errors. This is the "safe" variant: the explicit transaction guarantees fn's work is atomic, so a retried attempt starts from a clean slate. - RetryUnsafe retries fn (no transaction wrapper) on any error by default; the caller vouches that fn is idempotent. This is the "unsafe" variant for cases where a transaction isn't feasible. - RetryMySQLTx and RetrySpannerTx are added as not-implemented placeholders so the cross-backend API surface is visible and the gap is enforced by tests. The safe variant's default classifier (isRetryablePostgresTxError) is the OR of three signals: 1. driver.ErrBadConn -- pgx's stdlib adapter converts pgconn.SafeToRetry-flagged Exec/Query errors to driver.ErrBadConn before fn sees them; this is the primary retry trigger inside a transaction. 2. pgconn.SafeToRetry -- covers Begin/Commit failures where the original pgx error flows through unchanged. 3. IsRetryablePostgresError -- server-returned SQLSTATE codes that guarantee rollback (40001, 40P01, 57P01, ...). RetryTxOptions.IsRetryable is additive on top of the default classifier so consumers can extend it with implementation-specific cases without narrowing the safety floor. RetryUnsafeOptions.IsRetryable replaces the default (caller takes full control). The existing RetryPostgres is removed (the PR is unmerged). A stacked follow-up will add pg_current_xact_id()/pg_xact_status() commit verification; a TODO marks the insertion point in retryTx. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

Addresses review comments r3482973525 and r3483137216 on PR moov-io#490: without an explicit transaction there is no way to know whether a network error occurred before or after the server accepted the query, so retrying could duplicate committed work. IsRetryablePostgresError now classifies only PostgreSQL SQLSTATE codes that the server guarantees were rolled back before the error was returned (40001, 40P01, 57P01, 57P02, 57P03, 53300, 57014). The net.OpError, io.EOF, io.ErrUnexpectedEOF, and string-matching ("connection reset by peer"/"broken pipe"/"conn closed") checks are removed. The context.DeadlineExceeded check is removed as redundant (non-PgError errors already return false). The doc comment explains why network errors are excluded and points callers to RetryPostgresTx for transactional retry. The network-error checks move into a new internal isPostgresNetworkError helper, which is wired into isRetryablePostgresTxError as a 4th leg. Network errors ARE safe to retry within an explicit transaction (the server rolls back the uncommitted transaction), so RetryPostgresTx keeps them. String matching is dropped entirely -- driver.ErrBadConn and pgconn.SafeToRetry cover pgx's pre-send cases, and the typed net.OpError / io checks cover post-send cases, so substring matching is redundant and fragile. Tests updated: - TestIsRetryablePostgresError: io.EOF, io.ErrUnexpectedEOF, net.OpError, and the string-matched errors now assert false; SQLSTATE assertions unchanged. - TestIsRetryablePostgresTxError: added a Leg 4 section asserting io.EOF, io.ErrUnexpectedEOF, and net.OpError are retryable via the transactional classifier. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

…narrow) phases Addresses review feedback questioning the claim that commit-phase errors always roll back: a commit-phase error is ambiguous because the commit may have succeeded on the server before the error was returned to the client (e.g., the TCP connection was severed after the COMMIT message was sent but before the response was received). Retrying such an error could duplicate the committed work. Changes: - IsRetryablePostgresError -> isSafeRetryablePostgresError (unexported). It classified only SQLSTATE codes that the server guarantees were rolled back, and the recommended path for callers is now RetryPostgresTx or RetryUnsafe, not building a custom retry loop on top of this classifier. - isRetryablePostgresTxError -> isRetryablePostgresPreCommitError (broad, for errors from BeginTx or fn): driver.ErrBadConn + pgconn.SafeToRetry + isSafeRetryablePostgresError + isPostgresNetworkError. Any error from Begin or fn is safe to retry because the transaction rolls back, so network errors are included. Added a comment on the pgconn.SafeToRetry leg explaining pgx guarantees these always occur before any data is sent to the server. - New isRetryablePostgresCommitError (narrow, for errors from (*sql.Tx).Commit): driver.ErrBadConn + pgconn.SafeToRetry + isSafeRetryablePostgresError. Network errors are intentionally EXCLUDED because they could mean the commit succeeded but the response was lost. - New ErrCommitPhase sentinel: commit-phase errors that are not retried (e.g., a network error during commit) are wrapped with ErrCommitPhase so callers can detect them via errors.Is and decide whether to alert, reconcile, or check whether the commit landed (future pg_xact_status() work). The underlying error is preserved for inspection. - retryTx now takes a retryClassifier (preCommit + commit funcs) and runOneTxAttempt returns (err, commitPhase bool); the retry loop applies the commit classifier when commitPhase is true and wraps the final commit-phase error with ErrCommitPhase. opts.IsRetryable remains additive on top of BOTH phases. - TODO(stacked-pr) -> TODO(future) for the pg_current_xact_id()/ pg_xact_status() commit-verification hook. Tests: - Removed TestIsRetryablePostgresError from postgres_test.go (function is now private); SQLSTATE coverage moved to internal TestIsSafeRetryablePostgresError. - Renamed TestIsRetryablePostgresTxError -> TestIsRetryablePostgresPreCommitError. - Added TestIsRetryablePostgresCommitError (network NOT retryable at commit). - Added TestRetryPostgresTx subtests: commit-phase network error is not retried and is wrapped with ErrCommitPhase; commit-phase exhaustion wraps the final error with ErrCommitPhase. - Added TestErrCommitPhase (sentinel detection + underlying error preserved). Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

isRetryablePostgresCommitError was a strict subset of isRetryablePostgresPreCommitError (pre-send + server-rolled-back, without the pre-commit-only network leg), duplicating the driver.ErrBadConn and pgconn.SafeToRetry checks that also belonged to the 'safe at any phase' bucket. Folded those two legs into isSafeRetryablePostgresError so it becomes the single 'safe to retry regardless of transaction phase' classifier (pre-send + server-rolled-back SQLSTATE codes), and assigned it directly to postgresTxClassifier.commit. isRetryablePostgresPreCommitError now delegates: isSafeRetryablePostgresError || isPostgresNetworkError. Behavior is unchanged; only the code structure is simpler. Also documented on isRetryablePostgresPreCommitError the permanent-error categories that are intentionally NOT retried pre-commit even though retrying them would be safe (the transaction rolls back): constraint violations (23xxx), syntax errors (42601), schema errors (42xxx), permission errors (42501), and data exceptions (22xxx). These surface at query time (not connection time) and would fail again on retry, so retrying only adds ~200ms of latency before the caller receives the error. Callers who want to retry a broader set can pass opts.IsRetryable. Tests: folded TestIsRetryablePostgresCommitError assertions into TestIsSafeRetryablePostgresError (added driver.ErrBadConn + SafeToRetry -> true); removed TestIsRetryablePostgresCommitError; TestIsRetryablePostgresPreCommitError unchanged. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

…config Two related changes from review discussion: 1. Pre-commit classifier is now OPT-OUT. Pre-commit, retrying is always safe (the transaction rolls back), so the cost of a false positive (retrying a permanent error we didn't blocklist) is ~200ms of wasted latency, while the cost of a false negative (NOT retrying a transient error we didn't allowlist) is a spurious user-facing failure. The former is strictly safer, so isRetryablePostgresPreCommitError now retries everything EXCEPT context.Canceled/DeadlineExceeded and the permanent SQLSTATE classes (22xxx data exceptions, 23xxx constraint violations, 42xxx syntax/schema/permission), via a new isPermanentPostgresError helper. Unknown *pgconn.PgError codes and non-PgError errors are retried. The commit classifier (isSafeRetryablePostgresError) stays OPT-IN (narrow) because at commit retrying is not always safe. isPostgresNetworkError is removed (the pre-commit catch-all now covers network errors; the commit classifier never used it). 2. MaxAttempts is no longer configurable. Added a maxRetryAttempts=3 constant matching database/sql's maxBadConnRetries+1 convention, and removed the MaxAttempts field from RetryTxOptions and RetryUnsafeOptions (plus the '<= 0 -> default to 3' logic). Services that need a different retry budget should bound the total time via the context deadline or wrap the call in their own retry loop. Added a comment on retryTx explaining the interaction with database/sql's internal driver.ErrBadConn retry (immediate, up to 3 attempts for *sql.DB methods) — our outer retry layers on top with full-jitter backoff to survive longer outages; for *sql.Tx methods there is no internal retry, so this is the only one. Tests: TestIsRetryablePostgresPreCommitError rewritten for opt-out (application error and unknown PgError code now retryable; added 22xxx/ 42xxx permanent cases); removed isPostgresNetworkError comment refs; replaced all MaxAttempts: 3 with {} in RetryTxOptions/RetryUnsafeOptions. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

The driver.ErrBadConn check in isSafeRetryablePostgresError (the commit-phase classifier) was unreachable dead code: - *sql.Tx.Commit does not retry driver.ErrBadConn internally (only *sql.DB methods have the maxBadConnRetries loop — see golang/go#11264), so there is no database/sql retry to double up with at commit time. - pgx's wrapTx.Commit returns the native pgx error directly (return wtx.tx.Commit(wtx.ctx)); the SafeToRetry -> driver.ErrBadConn conversion only happens in Conn.ExecContext/QueryContext (i.e., for fn's tx.Exec/tx.Query), NOT for Commit. So the commit classifier never sees driver.ErrBadConn with pgx. - The pre-send commit case (a SafeToRetry-flagged pgx error from commit) is already covered by the pgconn.SafeToRetry leg. Removed the leg and the now-unused database/sql/driver import from postgres.go, and documented on isSafeRetryablePostgresError why driver.ErrBadConn is not checked there. driver.ErrBadConn from fn's tx.Exec is still retryable — that's a pre-commit error handled by isRetryablePostgresPreCommitError's opt-out catch-all (TestIsRetryablePostgresPreCommitError covers it). TestIsSafeRetryablePostgresError: removed the driver.ErrBadConn -> true assertions (now false) and added a comment explaining why driver.ErrBadConn isn't in the commit classifier. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

Trim verbose comments, drop general-knowledge explanations (e.g. Rollback is a no-op after Commit), remove rationale duplicated across function docs, and fix stale references to the renamed isRetryablePostgresTxError in the mysql.go/spanner.go placeholder TODOs. Behavior unchanged. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

HealthCheckPeriod drives the background goroutine that evicts connections exceeded MaxConnLifetime/MaxConnIdleTime; it does NOT ping for liveness. Liveness pinging is at acquire time (stdlib ResetSession, default ping if idle > 1s), with database/sql retrying on a fresh conn. The old comment claimed HealthCheckPeriod pings idle connections in the background and evicts dead connections before the app sees them — both wrong. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

quickdie() (the SIGQUIT handler that raises ERRCODE_CRASH_SHUTDOWN) does _exit(2) WITHOUT rolling back — the source comment says 'we don't want to try to clean up our transaction'. The rollback is via crash recovery on restart, not before the client sees 57P02. Contrast with die() (57P01 admin_shutdown), which processes the interrupt through proc_exit/ AbortTransaction and does roll back before sending. So 57P02 at commit is ambiguous: the commit may have been recorded before SIGQUIT, and quickdie skips rollback. Removed 57P02 from isSafeRetryable- PostgresError (the commit classifier) so it's wrapped with ErrCommitPhase at commit (caller decides). 57P02 is still retried pre-commit via isRetryablePostgresPreCommitError's opt-out — the tx didn't commit (process exited before committing), so retry is safe there. Tests: TestIsSafeRetryablePostgresError asserts 57P02 -> false at commit; TestIsRetryablePostgresPreCommitError asserts 57P02 -> true pre-commit (with a comment noting it's safe despite no pre-return rollback). Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

Commit classifier (isSafeRetryablePostgresError) now retries only errors that can NEVER mean 'possibly committed': - pgconn.SafeToRetry (pgx guarantees pre-send: COMMIT message never sent) - pgerrcode.IsTransactionRollback (class 40: server rolled back the tx before returning the ErrorResponse — 40001 serialization_failure, 40P01 deadlock_detected, 40002, 40003, 40000) Dropped 57P01 (admin_shutdown), 57P02 (crash_shutdown), 57014 (query_canceled), 57P03, 53300 from the commit classifier — each could mean 'possibly committed' (signal handlers can interrupt after the commit is recorded; quickdie skips rollback; cancel timing has an edge case; connect-time codes are irrelevant). They're still retried pre-commit via isRetryablePostgresPreCommitError's opt-out (the tx didn't commit pre-commit). At commit they're wrapped with ErrCommitPhase so the caller decides. Adopted github.com/jackc/pgerrcode (canonical Postgres error-code constants, by the pgx author) to eliminate string literals and the inline human-friendly- name comments: removed the local postgresErrUniqueViolation/postgresErrDeadlock- Found consts; PostgresUniqueViolation/PostgresDeadlockFound use pgerrcode; isPermanentPostgresError uses pgerrcode.IsDataException/IsIntegrityConstraint- Violation/IsSyntaxErrororAccessRuleViolation (class 22/23/42); the commit classifier uses pgerrcode.IsTransactionRollback (class 40). Tests: TestIsSafeRetryablePostgresError now asserts only SafeToRetry + class 40 retryable at commit (57P01/57P02/57P03/53300/57014 false); added a TestRetryPostgresTx subtest for 57P01-at-commit not retried + wrapped with ErrCommitPhase; TestIsRetryablePostgresPreCommitError covers 57P01/57P02/57P03/ 53300/57014 still retryable pre-commit; all string literals swapped for pgerrcode constants (postgres_test.go, database_test.go too). Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

…PostgresNonIdempotent Renames so the common case (selects, idempotent DML) doesn't put 'unsafe' all over callers' codebases, and the transactional variant's name signals it's for non-idempotent ops: RetryUnsafe -> RetryIdempotent RetryUnsafeOptions -> RetryIdempotentOptions RetryPostgresTx -> RetryPostgresNonIdempotent RetryMySQLTx -> RetryMySQLNonIdempotent (placeholder) RetrySpannerTx -> RetrySpannerNonIdempotent (placeholder) RetryTxOptions -> RetryNonIdempotentOptions Dropped the redundant 'Tx' suffix from the public names: the fn func(*sql.Tx) parameter makes the transaction obvious at every call site, so 'Tx' in the name added little. The internal retryTx helper keeps its name (unexported, descriptive). No behavior change. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

… + pg_xact_status RetryPostgresNonIdempotent now verifies whether an ambiguous commit landed, per r3470551946 and the PostgreSQL docs (pg_xact_status is documented for 'determine whether their transaction committed or aborted after the application and database server become disconnected while a COMMIT is in progress'). Always-on for the non-idempotent variant (no opt-in). Before each Commit, capture the transaction id: SELECT pg_current_xact_id_if_assigned()::text (NULL → read-only, no writes to duplicate). On a commit-phase error the commit classifier rejects (network, 57P01/57P02, etc.), query pg_xact_status($1::xid8) on a FRESH connection (the original may be dead; xid8 is globally unique so this works across connections/reconnects, modulo a failover replay-lag window): - 'aborted' → retry (commit didn't land). - 'committed' → ErrCommitted (don't retry — would duplicate). - NULL/'in progress'/error → ErrCommitPhase (ambiguous, caller decides). - read-only (xid nil) → retry without verifying (no writes to duplicate). Class-40 commit errors (40001/40P01, server rolled back) are retryable per the commit classifier, so they're retried without verification; on exhaustion they return the original error (not ErrCommitPhase — they're known non-commit). ErrCommitPhase is now only for INCONCLUSIVE commits. Implementation: - ErrCommitted sentinel (distinct from ErrCommitPhase). - commitStatus enum (aborted/committed/unknown). - retryDB interface (BeginTx + QueryRowContext) so retryTx can run the fresh-connection pg_xact_status query. - retryClassifier gains captureXid/verifyCommit hooks (nil for future MySQL/Spanner → ErrCommitPhase fallback, current behavior). - runOneTxAttempt captures the xid before Commit and returns it. - retryTx loop: on a classifier-rejected commit error, verify and decide retry / ErrCommitted / ErrCommitPhase; read-only → retry. - postgres.go: capturePostgresXid (::text cast sidesteps pgx's xid8 Go type-mapping) and verifyPostgresCommit (::xid8 cast), wired into postgresTxClassifier. Removed the TODO(future). Tests: mock driver implements QueryerContext, scripting the two verification queries (captureXidResult/verifyStatusResult/verifyErr). New TestRetryPostgres- NonIdempotent subtests cover verify-aborted→retry, verify-committed→ErrCommitted, verify-unknown→ErrCommitPhase, read-only→retry-without-verifying, verify-error →ErrCommitPhase, and class-40-exhaustion→original-error. TestErrCommitted added. ErrCommitPhase-exhaustion test repurposed (class-40 now returns the original, not ErrCommitPhase). Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

Now that commit verification (pg_xact_status) is in place, the separate 'commit classifier' (isSafeRetryablePostgresError) is unnecessary. Commit errors are rare, and we already pay a round-trip to capture the xid before every commit — so always checking pg_xact_status on a commit error (one more round-trip, only on the rare commit error) is negligible, and it lets the SERVER be the authority on whether the commit landed instead of a client-side guess. retryTx now: pre-commit errors → preCommit classifier (opt-out, unchanged); commit errors → always verify (read-only/xid-nil → retry without verifying; aborted → retry; committed → ErrCommitted; inconclusive → ErrCommitPhase). No commit classifier, no classify method. opts.IsRetryable is additive on preCommit only (commit errors are verified, not classified — safer, since a caller shouldn't override commit verification). Removed isSafeRetryablePostgresError (and its complex exclusion doc about 57P01/57P02/57014/57P03/53300/network/driver.ErrBadConn) and TestIsSafeRetryablePostgresError. postgresTxClassifier drops the commit field. Trade-off: class-40 / SafeToRetry commit errors now go through verification (+1 pg_xact_status round-trip; previously fast-pathed). No regression for common cases — class-40 → 'aborted' → retry (same outcome); the rare SafeToRetry+failover+replay-lag case → ErrCommitPhase (recoverable). Class-40 doesn't hit the replay-lag case (the server responded, no failover). Tests updated: SafeToRetry-commit subtest now verifies (verifyStatusResult= aborted, verifyCalls==1); class-40-exhaustion subtest verifies each attempt (verifyCalls==3) and still returns the original error (not ErrCommitPhase). Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

cursoragent and others added 2 commits June 29, 2026 15:49

elovelan marked this pull request as draft June 29, 2026 18:36

cursoragent and others added 11 commits June 29, 2026 18:56

elovelan changed the title ~~database: split RetryPostgres into safe RetryPostgresTx and generic RetryUnsafe~~ database: Postgres/AlloyDB network reliability — failover-safe pool + transactional retry with commit verification Jun 30, 2026

elovelan marked this pull request as ready for review June 30, 2026 19:10

elovelan changed the title ~~database: Postgres/AlloyDB network reliability — failover-safe pool + transactional retry with commit verification~~ database: split RetryPostgres into RetryIdempotent and RetryPostgresNonIdempotent Jun 30, 2026

elovelan closed this Jul 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

database: split RetryPostgres into RetryIdempotent and RetryPostgresNonIdempotent#2

database: split RetryPostgres into RetryIdempotent and RetryPostgresNonIdempotent#2
elovelan wants to merge 13 commits into
stevemsmith:fix/alloydb-failover-recoveryfrom
elovelan:cursor/postgres-retry-tx-0f34

elovelan commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

elovelan commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this adds

Pre-commit classifier (opt-out) + commit verification (always-on)

Non-configurable retry budget (maxRetryAttempts = 3)

pgerrcode adoption

Naming

Pluggable retryable cases

Out of scope (follow-ups)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

elovelan commented Jun 29, 2026 •

edited

Loading

Non-configurable retry budget (`maxRetryAttempts = 3`)

`pgerrcode` adoption