Skip to content

database: split RetryPostgres into RetryIdempotent and RetryPostgresNonIdempotent#2

Closed
elovelan wants to merge 13 commits into
stevemsmith:fix/alloydb-failover-recoveryfrom
elovelan:cursor/postgres-retry-tx-0f34
Closed

database: split RetryPostgres into RetryIdempotent and RetryPostgresNonIdempotent#2
elovelan wants to merge 13 commits into
stevemsmith:fix/alloydb-failover-recoveryfrom
elovelan:cursor/postgres-retry-tx-0f34

Conversation

@elovelan

@elovelan elovelan commented Jun 29, 2026

Copy link
Copy Markdown

Suggested implementation of the transaction-retry pattern from the review comments on this PR:

  • r3470551946 — recommending an explicit transaction (with fn taking a transaction argument like pgx.BeginTxFunc) instead of an implicit one, so there are fewer failure scenarios where a retry could duplicate committed work. Also: capture pg_current_xact_id() before commit and consult pg_xact_status() on failure to see if the commit was successful.
  • r3487236671 — recommending two distinct functions: one with automatic safe retries (transactional), and one for idempotent callers where the function being retried must be idempotent.
  • r3482973525 / r3483137216 — network errors and string-matched connection errors should be removed from the standalone classifier because without a transaction there's no way to know if the query landed.

What this adds

RetryPostgresNonIdempotent (safe, transactional) — wraps fn in an explicit BeginTx/Commit/Rollback and retries the whole transaction on transient errors. fn receives a *sql.Tx each attempt. This is the shape of pgx.BeginTxFunc and pgxtransactions.Transaction, and what jackc agreed should work. Use this for non-idempotent operations. Commit verification is always on (see below).

RetryIdempotent (generic, idempotent-caller) — retries fn (no transaction wrapper) on any error by default except context.Canceled/context.DeadlineExceeded. The caller vouches that fn is idempotent. Lighter-weight option for selects and idempotent DML (no transaction overhead, no commit-verification round-trip).

RetryMySQLNonIdempotent / RetrySpannerNonIdempotent (placeholders) — return errors.New("... not yet implemented") with TODOs. They share RetryNonIdempotentOptions and the fn func(*sql.Tx) error signature so the cross-backend API surface is visible.

The existing RetryPostgres is removed (this PR is unmerged). Callers should use RetryPostgresNonIdempotent (transactional) or RetryIdempotent (idempotent).

Pre-commit classifier (opt-out) + commit verification (always-on)

Pre-commit errors (from BeginTx or fn) — opt-OUT classifier isRetryablePostgresPreCommitError: pre-commit retry is always safe (the transaction rolls back), so it retries everything except context.Canceled/context.DeadlineExceeded and the permanent SQLSTATE classes (22xxx, 23xxx, 42xxx — via pgerrcode.IsDataException/IsIntegrityConstraintViolation/IsSyntaxErrororAccessRuleViolation). Unknown *pgconn.PgError codes and non-PgError errors are retried; a false positive (~200ms wasted) costs less than a false negative (spurious user-facing failure). opts.IsRetryable is additive (OR) on this classifier.

Commit errors (from (*sql.Tx).Commit) — always verified, not classified. The server is the authority on whether a commit landed (a client-side guess can't tell committed-but-response-lost from genuinely-aborted), so instead of a "retryable commit" classifier we always check pg_xact_status. Per r3470551946 and the PostgreSQL docs (which explicitly endorse pg_xact_status for "determine whether their transaction committed or aborted after the application and database server become disconnected while a COMMIT is in progress"):

  1. Before each Commit, capture the transaction id: SELECT pg_current_xact_id_if_assigned()::text (NULL → read-only; the ::text cast sidesteps pgx's xid8 Go type-mapping).
  2. On a commit error, query SELECT pg_xact_status($1::xid8) on a fresh connection (the original may be dead; xid8 is globally unique so this works across connections/reconnects, modulo a failover replay-lag window):
    • 'aborted' → retry (the commit didn't land).
    • 'committed'ErrCommitted (don't retry — would duplicate; the caller reconciles/alerts).
    • NULL / 'in progress' / query errors → ErrCommitPhase (ambiguous; caller decides) - unlikely, but could happen during an asynchronous regional failover or true regional outage.
  3. Read-only transactions (no xid) retry without verifying — no writes to duplicate.

A retryable commit error (verified-aborted or read-only) on exhaustion returns the original error (not ErrCommitPhase — it's known non-commit, just out of retries). ErrCommitPhase is only for inconclusive commits.

if errors.Is(err, database.ErrCommitted) {
    // Verified committed: do NOT retry (would duplicate). Reconcile/alert.
}

if errors.Is(err, database.ErrCommitPhase) {
    // Inconclusive: the commit may or may not have landed. Decide / re-check.
}

Both preserve the underlying commit error for inspection via errors.Is/errors.As.

Why no commit classifier? An earlier iteration had a "maximally conservative" commit classifier (isSafeRetryablePostgresError) that fast-pathed pgconn.SafeToRetry + class-40 commit errors to retry without verification. But commit errors are rare, and the xid-capture round-trip is already paid on every commit — so always checking pg_xact_status on a commit error (one more round-trip, only on the rare commit error) is negligible, and it removes the whole client-side guessing exercise (the 57P01/57P02/57014/57P03/53300/network/driver.ErrBadConn exclusion reasoning). Class-40 → pg_xact_status says 'aborted' → retry (same outcome, +1 round-trip); the rare SafeToRetry+failover+replay-lag case → ErrCommitPhase (recoverable). opts.IsRetryable no longer applies to commit errors — safer, since a caller shouldn't override commit verification.

Trade-off: commit verification adds one round-trip per transaction (pg_current_xact_id_if_assigned() before every commit), paid always but only used on rare commit errors. Always-on for the non-idempotent variant (where a duplicate-commit hurts most); RetryIdempotent skips it (no transaction).

Non-configurable retry budget (maxRetryAttempts = 3)

MaxAttempts is not configurable; both loops use maxRetryAttempts = 3 (in order to match database/sql's maxBadConnRetries+1). Use the context deadline to bound total time, or wrap for more. database/sql already retries driver.ErrBadConn for *sql.DB methods (immediate, up to 3 attempts); this outer loop layers on top with jitter for longer outages. *sql.Tx methods have no internal retry, so this is the only retry for tx.Exec/tx.Commit.

pgerrcode adoption

Adopted github.com/jackc/pgerrcode (canonical Postgres error-code constants, by the pgx author) to eliminate string literals and inline human-friendly-name comments: PostgresUniqueViolation/PostgresDeadlockFound use pgerrcode.UniqueViolation/pgerrcode.DeadlockDetected; isPermanentPostgresError uses pgerrcode.IsDataException/IsIntegrityConstraintViolation/IsSyntaxErrororAccessRuleViolation. Tests use pgerrcode constants throughout.

Naming

  • RetryIdempotent (no transaction wrapper) — for selects and idempotent DML; the caller vouches fn is idempotent. Chosen over "unsafe" so the common case doesn't put "unsafe" all over callers' codebases.
  • RetryPostgresNonIdempotent (transactional + commit verification) — for non-idempotent ops. The Tx suffix was dropped (the fn func(*sql.Tx) error parameter makes the transaction obvious at the call site).
  • RetryIdempotentOptions / RetryNonIdempotentOptions (shared by the Postgres/MySQL/Spanner non-idempotent variants) — parallel to the function names.
  • The internal retryTx helper keeps its name (unexported, descriptive).

Pluggable retryable cases

  • RetryNonIdempotentOptions.IsRetryable is additive (OR) on the pre-commit classifier (commit errors are always verified, not classified).
  • RetryIdempotentOptions.IsRetryable replaces the default (nil → retry any error except context cancellation/deadline).

Out of scope (follow-ups)

  • Implementing RetryMySQLNonIdempotent / RetrySpannerNonIdempotent — placeholders only (no pg_xact_status equivalent wired; they'd fall back to ErrCommitPhase for commit errors until each backend provides a verifier).
  • A sql.DB.RetryPostgresNonIdempotent method on the sql/db.go wrapper — deferred (would add a sqldatabase dependency).
  • Integration test against a real Postgres for the verification path (simulating an ambiguous commit) — the unit tests use a mock driver that scripts pg_current_xact_id_if_assigned/pg_xact_status returns; a real-PG integration test would confirm the SQL and xid8 round-tripping.

Test plan

  • TestRetryIdempotent — success, retries on any error by default (incl. 23505), custom IsRetryable short-circuits, no retry on context.Canceled, ctx cancellation, attempt exhaustion.
  • TestRetryPostgresNonIdempotent — success; fn-error-then-succeed (rollback between attempts); non-retryable fn error short-circuits (23505); driver.ErrBadConn from fn retried; SafeToRetry Begin failure retried; SafeToRetry Commit failure → verified-aborted → retry; verify-aborted → retry; verify-committed → ErrCommitted; verify-unknown (NULL) → ErrCommitPhase; verify-error → ErrCommitPhase; read-only (no xid) → retry without verifying; class-40 exhaustion → verify-aborted each attempt → original error (not ErrCommitPhase); 57P01+inconclusive → ErrCommitPhase; additive opts.IsRetryable (pre-commit); ctx cancellation; attempt exhaustion. Uses a minimal database/sql/driver mock (with QueryerContext to script the verification queries).
  • TestIsRetryablePostgresPreCommitError — pre-commit classifier leg coverage.
  • TestErrCommitPhase / TestErrCommitted — sentinel detection; underlying error preserved; distinct from each other.
  • TestRetryMySQLNonIdempotentNotImplemented / TestRetrySpannerNonIdempotentNotImplemented — placeholders return "not yet implemented".
  • go build ./..., go vet ./..., go test ./database/... -short all pass.
Open in Web Open in Cursor 

cursoragent and others added 2 commits June 29, 2026 15:49
…etryUnsafe

Implements the transaction-retry pattern suggested in PR moov-io#490 review
comments (discussion_r3470551946 and discussion_r3487236671):

- RetryPostgresTx wraps fn in an explicit transaction (fn receives a
  *sql.Tx) and retries the whole transaction on transient errors. This
  is the "safe" variant: the explicit transaction guarantees fn's work
  is atomic, so a retried attempt starts from a clean slate.
- RetryUnsafe retries fn (no transaction wrapper) on any error by
  default; the caller vouches that fn is idempotent. This is the
  "unsafe" variant for cases where a transaction isn't feasible.
- RetryMySQLTx and RetrySpannerTx are added as not-implemented
  placeholders so the cross-backend API surface is visible and the gap
  is enforced by tests.

The safe variant's default classifier (isRetryablePostgresTxError) is
the OR of three signals:
  1. driver.ErrBadConn -- pgx's stdlib adapter converts
     pgconn.SafeToRetry-flagged Exec/Query errors to driver.ErrBadConn
     before fn sees them; this is the primary retry trigger inside a
     transaction.
  2. pgconn.SafeToRetry -- covers Begin/Commit failures where the
     original pgx error flows through unchanged.
  3. IsRetryablePostgresError -- server-returned SQLSTATE codes that
     guarantee rollback (40001, 40P01, 57P01, ...).

RetryTxOptions.IsRetryable is additive on top of the default classifier
so consumers can extend it with implementation-specific cases without
narrowing the safety floor. RetryUnsafeOptions.IsRetryable replaces the
default (caller takes full control).

The existing RetryPostgres is removed (the PR is unmerged). A stacked
follow-up will add pg_current_xact_id()/pg_xact_status() commit
verification; a TODO marks the insertion point in retryTx.

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
Addresses review comments r3482973525 and r3483137216 on PR moov-io#490: without
an explicit transaction there is no way to know whether a network error
occurred before or after the server accepted the query, so retrying could
duplicate committed work.

IsRetryablePostgresError now classifies only PostgreSQL SQLSTATE codes that
the server guarantees were rolled back before the error was returned
(40001, 40P01, 57P01, 57P02, 57P03, 53300, 57014). The net.OpError,
io.EOF, io.ErrUnexpectedEOF, and string-matching ("connection reset by
peer"/"broken pipe"/"conn closed") checks are removed. The
context.DeadlineExceeded check is removed as redundant (non-PgError errors
already return false). The doc comment explains why network errors are
excluded and points callers to RetryPostgresTx for transactional retry.

The network-error checks move into a new internal isPostgresNetworkError
helper, which is wired into isRetryablePostgresTxError as a 4th leg.
Network errors ARE safe to retry within an explicit transaction (the
server rolls back the uncommitted transaction), so RetryPostgresTx keeps
them. String matching is dropped entirely -- driver.ErrBadConn and
pgconn.SafeToRetry cover pgx's pre-send cases, and the typed net.OpError
/ io checks cover post-send cases, so substring matching is redundant and
fragile.

Tests updated:
- TestIsRetryablePostgresError: io.EOF, io.ErrUnexpectedEOF, net.OpError,
  and the string-matched errors now assert false; SQLSTATE assertions
  unchanged.
- TestIsRetryablePostgresTxError: added a Leg 4 section asserting
  io.EOF, io.ErrUnexpectedEOF, and net.OpError are retryable via the
  transactional classifier.

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
@elovelan elovelan marked this pull request as draft June 29, 2026 18:36
cursoragent and others added 11 commits June 29, 2026 18:56
…narrow) phases

Addresses review feedback questioning the claim that commit-phase errors
always roll back: a commit-phase error is ambiguous because the commit may
have succeeded on the server before the error was returned to the client
(e.g., the TCP connection was severed after the COMMIT message was sent but
before the response was received). Retrying such an error could duplicate
the committed work.

Changes:

- IsRetryablePostgresError -> isSafeRetryablePostgresError (unexported). It
  classified only SQLSTATE codes that the server guarantees were rolled
  back, and the recommended path for callers is now RetryPostgresTx or
  RetryUnsafe, not building a custom retry loop on top of this classifier.

- isRetryablePostgresTxError -> isRetryablePostgresPreCommitError (broad,
  for errors from BeginTx or fn): driver.ErrBadConn + pgconn.SafeToRetry +
  isSafeRetryablePostgresError + isPostgresNetworkError. Any error from
  Begin or fn is safe to retry because the transaction rolls back, so
  network errors are included. Added a comment on the pgconn.SafeToRetry
  leg explaining pgx guarantees these always occur before any data is sent
  to the server.

- New isRetryablePostgresCommitError (narrow, for errors from
  (*sql.Tx).Commit): driver.ErrBadConn + pgconn.SafeToRetry +
  isSafeRetryablePostgresError. Network errors are intentionally EXCLUDED
  because they could mean the commit succeeded but the response was lost.

- New ErrCommitPhase sentinel: commit-phase errors that are not retried
  (e.g., a network error during commit) are wrapped with ErrCommitPhase so
  callers can detect them via errors.Is and decide whether to alert,
  reconcile, or check whether the commit landed (future pg_xact_status()
  work). The underlying error is preserved for inspection.

- retryTx now takes a retryClassifier (preCommit + commit funcs) and
  runOneTxAttempt returns (err, commitPhase bool); the retry loop applies
  the commit classifier when commitPhase is true and wraps the final
  commit-phase error with ErrCommitPhase. opts.IsRetryable remains
  additive on top of BOTH phases.

- TODO(stacked-pr) -> TODO(future) for the pg_current_xact_id()/
  pg_xact_status() commit-verification hook.

Tests:
- Removed TestIsRetryablePostgresError from postgres_test.go (function is
  now private); SQLSTATE coverage moved to internal TestIsSafeRetryablePostgresError.
- Renamed TestIsRetryablePostgresTxError -> TestIsRetryablePostgresPreCommitError.
- Added TestIsRetryablePostgresCommitError (network NOT retryable at commit).
- Added TestRetryPostgresTx subtests: commit-phase network error is not
  retried and is wrapped with ErrCommitPhase; commit-phase exhaustion wraps
  the final error with ErrCommitPhase.
- Added TestErrCommitPhase (sentinel detection + underlying error preserved).

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
isRetryablePostgresCommitError was a strict subset of
isRetryablePostgresPreCommitError (pre-send + server-rolled-back, without
the pre-commit-only network leg), duplicating the driver.ErrBadConn and
pgconn.SafeToRetry checks that also belonged to the 'safe at any phase'
bucket. Folded those two legs into isSafeRetryablePostgresError so it
becomes the single 'safe to retry regardless of transaction phase'
classifier (pre-send + server-rolled-back SQLSTATE codes), and assigned
it directly to postgresTxClassifier.commit. isRetryablePostgresPreCommitError
now delegates: isSafeRetryablePostgresError || isPostgresNetworkError.
Behavior is unchanged; only the code structure is simpler.

Also documented on isRetryablePostgresPreCommitError the permanent-error
categories that are intentionally NOT retried pre-commit even though
retrying them would be safe (the transaction rolls back): constraint
violations (23xxx), syntax errors (42601), schema errors (42xxx),
permission errors (42501), and data exceptions (22xxx). These surface at
query time (not connection time) and would fail again on retry, so
retrying only adds ~200ms of latency before the caller receives the
error. Callers who want to retry a broader set can pass opts.IsRetryable.

Tests: folded TestIsRetryablePostgresCommitError assertions into
TestIsSafeRetryablePostgresError (added driver.ErrBadConn + SafeToRetry
-> true); removed TestIsRetryablePostgresCommitError;
TestIsRetryablePostgresPreCommitError unchanged.

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
…config

Two related changes from review discussion:

1. Pre-commit classifier is now OPT-OUT. Pre-commit, retrying is always
   safe (the transaction rolls back), so the cost of a false positive
   (retrying a permanent error we didn't blocklist) is ~200ms of wasted
   latency, while the cost of a false negative (NOT retrying a transient
   error we didn't allowlist) is a spurious user-facing failure. The
   former is strictly safer, so isRetryablePostgresPreCommitError now
   retries everything EXCEPT context.Canceled/DeadlineExceeded and the
   permanent SQLSTATE classes (22xxx data exceptions, 23xxx constraint
   violations, 42xxx syntax/schema/permission), via a new
   isPermanentPostgresError helper. Unknown *pgconn.PgError codes and
   non-PgError errors are retried. The commit classifier
   (isSafeRetryablePostgresError) stays OPT-IN (narrow) because at commit
   retrying is not always safe. isPostgresNetworkError is removed (the
   pre-commit catch-all now covers network errors; the commit classifier
   never used it).

2. MaxAttempts is no longer configurable. Added a maxRetryAttempts=3
   constant matching database/sql's maxBadConnRetries+1 convention, and
   removed the MaxAttempts field from RetryTxOptions and RetryUnsafeOptions
   (plus the '<= 0 -> default to 3' logic). Services that need a different
   retry budget should bound the total time via the context deadline or
   wrap the call in their own retry loop. Added a comment on retryTx
   explaining the interaction with database/sql's internal driver.ErrBadConn
   retry (immediate, up to 3 attempts for *sql.DB methods) — our outer
   retry layers on top with full-jitter backoff to survive longer outages;
   for *sql.Tx methods there is no internal retry, so this is the only one.

Tests: TestIsRetryablePostgresPreCommitError rewritten for opt-out
(application error and unknown PgError code now retryable; added 22xxx/
42xxx permanent cases); removed isPostgresNetworkError comment refs;
replaced all MaxAttempts: 3 with {} in RetryTxOptions/RetryUnsafeOptions.

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
The driver.ErrBadConn check in isSafeRetryablePostgresError (the commit-phase
classifier) was unreachable dead code:

  - *sql.Tx.Commit does not retry driver.ErrBadConn internally (only *sql.DB
    methods have the maxBadConnRetries loop — see golang/go#11264), so there
    is no database/sql retry to double up with at commit time.
  - pgx's wrapTx.Commit returns the native pgx error directly
    (return wtx.tx.Commit(wtx.ctx)); the SafeToRetry -> driver.ErrBadConn
    conversion only happens in Conn.ExecContext/QueryContext (i.e., for fn's
    tx.Exec/tx.Query), NOT for Commit. So the commit classifier never sees
    driver.ErrBadConn with pgx.
  - The pre-send commit case (a SafeToRetry-flagged pgx error from commit) is
    already covered by the pgconn.SafeToRetry leg.

Removed the leg and the now-unused database/sql/driver import from postgres.go,
and documented on isSafeRetryablePostgresError why driver.ErrBadConn is not
checked there. driver.ErrBadConn from fn's tx.Exec is still retryable — that's
a pre-commit error handled by isRetryablePostgresPreCommitError's opt-out
catch-all (TestIsRetryablePostgresPreCommitError covers it).

TestIsSafeRetryablePostgresError: removed the driver.ErrBadConn -> true
assertions (now false) and added a comment explaining why driver.ErrBadConn
isn't in the commit classifier.

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
Trim verbose comments, drop general-knowledge explanations (e.g. Rollback is
a no-op after Commit), remove rationale duplicated across function docs, and
fix stale references to the renamed isRetryablePostgresTxError in the
mysql.go/spanner.go placeholder TODOs. Behavior unchanged.

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
HealthCheckPeriod drives the background goroutine that evicts connections
exceeded MaxConnLifetime/MaxConnIdleTime; it does NOT ping for liveness.
Liveness pinging is at acquire time (stdlib ResetSession, default ping if
idle > 1s), with database/sql retrying on a fresh conn. The old comment
claimed HealthCheckPeriod pings idle connections in the background and
evicts dead connections before the app sees them — both wrong.

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
quickdie() (the SIGQUIT handler that raises ERRCODE_CRASH_SHUTDOWN) does
_exit(2) WITHOUT rolling back — the source comment says 'we don't want to
try to clean up our transaction'. The rollback is via crash recovery on
restart, not before the client sees 57P02. Contrast with die() (57P01
admin_shutdown), which processes the interrupt through proc_exit/
AbortTransaction and does roll back before sending.

So 57P02 at commit is ambiguous: the commit may have been recorded before
SIGQUIT, and quickdie skips rollback. Removed 57P02 from isSafeRetryable-
PostgresError (the commit classifier) so it's wrapped with ErrCommitPhase
at commit (caller decides). 57P02 is still retried pre-commit via
isRetryablePostgresPreCommitError's opt-out — the tx didn't commit (process
exited before committing), so retry is safe there.

Tests: TestIsSafeRetryablePostgresError asserts 57P02 -> false at commit;
TestIsRetryablePostgresPreCommitError asserts 57P02 -> true pre-commit (with
a comment noting it's safe despite no pre-return rollback).

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
Commit classifier (isSafeRetryablePostgresError) now retries only errors that
can NEVER mean 'possibly committed':

  - pgconn.SafeToRetry (pgx guarantees pre-send: COMMIT message never sent)
  - pgerrcode.IsTransactionRollback (class 40: server rolled back the tx
    before returning the ErrorResponse — 40001 serialization_failure, 40P01
    deadlock_detected, 40002, 40003, 40000)

Dropped 57P01 (admin_shutdown), 57P02 (crash_shutdown), 57014 (query_canceled),
57P03, 53300 from the commit classifier — each could mean 'possibly committed'
(signal handlers can interrupt after the commit is recorded; quickdie skips
rollback; cancel timing has an edge case; connect-time codes are irrelevant).
They're still retried pre-commit via isRetryablePostgresPreCommitError's
opt-out (the tx didn't commit pre-commit). At commit they're wrapped with
ErrCommitPhase so the caller decides.

Adopted github.com/jackc/pgerrcode (canonical Postgres error-code constants,
by the pgx author) to eliminate string literals and the inline human-friendly-
name comments: removed the local postgresErrUniqueViolation/postgresErrDeadlock-
Found consts; PostgresUniqueViolation/PostgresDeadlockFound use pgerrcode;
isPermanentPostgresError uses pgerrcode.IsDataException/IsIntegrityConstraint-
Violation/IsSyntaxErrororAccessRuleViolation (class 22/23/42); the commit
classifier uses pgerrcode.IsTransactionRollback (class 40).

Tests: TestIsSafeRetryablePostgresError now asserts only SafeToRetry + class 40
retryable at commit (57P01/57P02/57P03/53300/57014 false); added a
TestRetryPostgresTx subtest for 57P01-at-commit not retried + wrapped with
ErrCommitPhase; TestIsRetryablePostgresPreCommitError covers 57P01/57P02/57P03/
53300/57014 still retryable pre-commit; all string literals swapped for
pgerrcode constants (postgres_test.go, database_test.go too).

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
…PostgresNonIdempotent

Renames so the common case (selects, idempotent DML) doesn't put 'unsafe' all
over callers' codebases, and the transactional variant's name signals it's for
non-idempotent ops:

  RetryUnsafe             -> RetryIdempotent
  RetryUnsafeOptions      -> RetryIdempotentOptions
  RetryPostgresTx         -> RetryPostgresNonIdempotent
  RetryMySQLTx            -> RetryMySQLNonIdempotent  (placeholder)
  RetrySpannerTx          -> RetrySpannerNonIdempotent (placeholder)
  RetryTxOptions          -> RetryNonIdempotentOptions

Dropped the redundant 'Tx' suffix from the public names: the fn func(*sql.Tx)
parameter makes the transaction obvious at every call site, so 'Tx' in the
name added little. The internal retryTx helper keeps its name (unexported,
descriptive). No behavior change.

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
… + pg_xact_status

RetryPostgresNonIdempotent now verifies whether an ambiguous commit landed,
per r3470551946 and the PostgreSQL docs (pg_xact_status is documented for
'determine whether their transaction committed or aborted after the
application and database server become disconnected while a COMMIT is in
progress'). Always-on for the non-idempotent variant (no opt-in).

Before each Commit, capture the transaction id:
  SELECT pg_current_xact_id_if_assigned()::text
(NULL → read-only, no writes to duplicate). On a commit-phase error the
commit classifier rejects (network, 57P01/57P02, etc.), query
pg_xact_status($1::xid8) on a FRESH connection (the original may be dead;
xid8 is globally unique so this works across connections/reconnects, modulo
a failover replay-lag window):

  - 'aborted'     → retry (commit didn't land).
  - 'committed'   → ErrCommitted (don't retry — would duplicate).
  - NULL/'in progress'/error → ErrCommitPhase (ambiguous, caller decides).
  - read-only (xid nil) → retry without verifying (no writes to duplicate).

Class-40 commit errors (40001/40P01, server rolled back) are retryable per
the commit classifier, so they're retried without verification; on exhaustion
they return the original error (not ErrCommitPhase — they're known
non-commit). ErrCommitPhase is now only for INCONCLUSIVE commits.

Implementation:
  - ErrCommitted sentinel (distinct from ErrCommitPhase).
  - commitStatus enum (aborted/committed/unknown).
  - retryDB interface (BeginTx + QueryRowContext) so retryTx can run the
    fresh-connection pg_xact_status query.
  - retryClassifier gains captureXid/verifyCommit hooks (nil for future
    MySQL/Spanner → ErrCommitPhase fallback, current behavior).
  - runOneTxAttempt captures the xid before Commit and returns it.
  - retryTx loop: on a classifier-rejected commit error, verify and decide
    retry / ErrCommitted / ErrCommitPhase; read-only → retry.
  - postgres.go: capturePostgresXid (::text cast sidesteps pgx's xid8 Go
    type-mapping) and verifyPostgresCommit (::xid8 cast), wired into
    postgresTxClassifier. Removed the TODO(future).

Tests: mock driver implements QueryerContext, scripting the two verification
queries (captureXidResult/verifyStatusResult/verifyErr). New TestRetryPostgres-
NonIdempotent subtests cover verify-aborted→retry, verify-committed→ErrCommitted,
verify-unknown→ErrCommitPhase, read-only→retry-without-verifying, verify-error
→ErrCommitPhase, and class-40-exhaustion→original-error. TestErrCommitted
added. ErrCommitPhase-exhaustion test repurposed (class-40 now returns the
original, not ErrCommitPhase).

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
Now that commit verification (pg_xact_status) is in place, the separate
'commit classifier' (isSafeRetryablePostgresError) is unnecessary. Commit
errors are rare, and we already pay a round-trip to capture the xid before
every commit — so always checking pg_xact_status on a commit error (one more
round-trip, only on the rare commit error) is negligible, and it lets the
SERVER be the authority on whether the commit landed instead of a client-side
guess.

retryTx now: pre-commit errors → preCommit classifier (opt-out, unchanged);
commit errors → always verify (read-only/xid-nil → retry without verifying;
aborted → retry; committed → ErrCommitted; inconclusive → ErrCommitPhase).
No commit classifier, no classify method. opts.IsRetryable is additive on
preCommit only (commit errors are verified, not classified — safer, since a
caller shouldn't override commit verification).

Removed isSafeRetryablePostgresError (and its complex exclusion doc about
57P01/57P02/57014/57P03/53300/network/driver.ErrBadConn) and
TestIsSafeRetryablePostgresError. postgresTxClassifier drops the commit field.

Trade-off: class-40 / SafeToRetry commit errors now go through verification
(+1 pg_xact_status round-trip; previously fast-pathed). No regression for
common cases — class-40 → 'aborted' → retry (same outcome); the rare
SafeToRetry+failover+replay-lag case → ErrCommitPhase (recoverable). Class-40
doesn't hit the replay-lag case (the server responded, no failover).

Tests updated: SafeToRetry-commit subtest now verifies (verifyStatusResult=
aborted, verifyCalls==1); class-40-exhaustion subtest verifies each attempt
(verifyCalls==3) and still returns the original error (not ErrCommitPhase).

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
@elovelan elovelan changed the title database: split RetryPostgres into safe RetryPostgresTx and generic RetryUnsafe database: Postgres/AlloyDB network reliability — failover-safe pool + transactional retry with commit verification Jun 30, 2026
@elovelan elovelan marked this pull request as ready for review June 30, 2026 19:10
@elovelan elovelan changed the title database: Postgres/AlloyDB network reliability — failover-safe pool + transactional retry with commit verification database: split RetryPostgres into RetryIdempotent and RetryPostgresNonIdempotent Jun 30, 2026
@elovelan elovelan closed this Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants