database: split RetryPostgres into RetryIdempotent and RetryPostgresNonIdempotent#2
Closed
elovelan wants to merge 13 commits into
Closed
Conversation
…etryUnsafe Implements the transaction-retry pattern suggested in PR moov-io#490 review comments (discussion_r3470551946 and discussion_r3487236671): - RetryPostgresTx wraps fn in an explicit transaction (fn receives a *sql.Tx) and retries the whole transaction on transient errors. This is the "safe" variant: the explicit transaction guarantees fn's work is atomic, so a retried attempt starts from a clean slate. - RetryUnsafe retries fn (no transaction wrapper) on any error by default; the caller vouches that fn is idempotent. This is the "unsafe" variant for cases where a transaction isn't feasible. - RetryMySQLTx and RetrySpannerTx are added as not-implemented placeholders so the cross-backend API surface is visible and the gap is enforced by tests. The safe variant's default classifier (isRetryablePostgresTxError) is the OR of three signals: 1. driver.ErrBadConn -- pgx's stdlib adapter converts pgconn.SafeToRetry-flagged Exec/Query errors to driver.ErrBadConn before fn sees them; this is the primary retry trigger inside a transaction. 2. pgconn.SafeToRetry -- covers Begin/Commit failures where the original pgx error flows through unchanged. 3. IsRetryablePostgresError -- server-returned SQLSTATE codes that guarantee rollback (40001, 40P01, 57P01, ...). RetryTxOptions.IsRetryable is additive on top of the default classifier so consumers can extend it with implementation-specific cases without narrowing the safety floor. RetryUnsafeOptions.IsRetryable replaces the default (caller takes full control). The existing RetryPostgres is removed (the PR is unmerged). A stacked follow-up will add pg_current_xact_id()/pg_xact_status() commit verification; a TODO marks the insertion point in retryTx. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
Addresses review comments r3482973525 and r3483137216 on PR moov-io#490: without an explicit transaction there is no way to know whether a network error occurred before or after the server accepted the query, so retrying could duplicate committed work. IsRetryablePostgresError now classifies only PostgreSQL SQLSTATE codes that the server guarantees were rolled back before the error was returned (40001, 40P01, 57P01, 57P02, 57P03, 53300, 57014). The net.OpError, io.EOF, io.ErrUnexpectedEOF, and string-matching ("connection reset by peer"/"broken pipe"/"conn closed") checks are removed. The context.DeadlineExceeded check is removed as redundant (non-PgError errors already return false). The doc comment explains why network errors are excluded and points callers to RetryPostgresTx for transactional retry. The network-error checks move into a new internal isPostgresNetworkError helper, which is wired into isRetryablePostgresTxError as a 4th leg. Network errors ARE safe to retry within an explicit transaction (the server rolls back the uncommitted transaction), so RetryPostgresTx keeps them. String matching is dropped entirely -- driver.ErrBadConn and pgconn.SafeToRetry cover pgx's pre-send cases, and the typed net.OpError / io checks cover post-send cases, so substring matching is redundant and fragile. Tests updated: - TestIsRetryablePostgresError: io.EOF, io.ErrUnexpectedEOF, net.OpError, and the string-matched errors now assert false; SQLSTATE assertions unchanged. - TestIsRetryablePostgresTxError: added a Leg 4 section asserting io.EOF, io.ErrUnexpectedEOF, and net.OpError are retryable via the transactional classifier. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
…narrow) phases Addresses review feedback questioning the claim that commit-phase errors always roll back: a commit-phase error is ambiguous because the commit may have succeeded on the server before the error was returned to the client (e.g., the TCP connection was severed after the COMMIT message was sent but before the response was received). Retrying such an error could duplicate the committed work. Changes: - IsRetryablePostgresError -> isSafeRetryablePostgresError (unexported). It classified only SQLSTATE codes that the server guarantees were rolled back, and the recommended path for callers is now RetryPostgresTx or RetryUnsafe, not building a custom retry loop on top of this classifier. - isRetryablePostgresTxError -> isRetryablePostgresPreCommitError (broad, for errors from BeginTx or fn): driver.ErrBadConn + pgconn.SafeToRetry + isSafeRetryablePostgresError + isPostgresNetworkError. Any error from Begin or fn is safe to retry because the transaction rolls back, so network errors are included. Added a comment on the pgconn.SafeToRetry leg explaining pgx guarantees these always occur before any data is sent to the server. - New isRetryablePostgresCommitError (narrow, for errors from (*sql.Tx).Commit): driver.ErrBadConn + pgconn.SafeToRetry + isSafeRetryablePostgresError. Network errors are intentionally EXCLUDED because they could mean the commit succeeded but the response was lost. - New ErrCommitPhase sentinel: commit-phase errors that are not retried (e.g., a network error during commit) are wrapped with ErrCommitPhase so callers can detect them via errors.Is and decide whether to alert, reconcile, or check whether the commit landed (future pg_xact_status() work). The underlying error is preserved for inspection. - retryTx now takes a retryClassifier (preCommit + commit funcs) and runOneTxAttempt returns (err, commitPhase bool); the retry loop applies the commit classifier when commitPhase is true and wraps the final commit-phase error with ErrCommitPhase. opts.IsRetryable remains additive on top of BOTH phases. - TODO(stacked-pr) -> TODO(future) for the pg_current_xact_id()/ pg_xact_status() commit-verification hook. Tests: - Removed TestIsRetryablePostgresError from postgres_test.go (function is now private); SQLSTATE coverage moved to internal TestIsSafeRetryablePostgresError. - Renamed TestIsRetryablePostgresTxError -> TestIsRetryablePostgresPreCommitError. - Added TestIsRetryablePostgresCommitError (network NOT retryable at commit). - Added TestRetryPostgresTx subtests: commit-phase network error is not retried and is wrapped with ErrCommitPhase; commit-phase exhaustion wraps the final error with ErrCommitPhase. - Added TestErrCommitPhase (sentinel detection + underlying error preserved). Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
isRetryablePostgresCommitError was a strict subset of isRetryablePostgresPreCommitError (pre-send + server-rolled-back, without the pre-commit-only network leg), duplicating the driver.ErrBadConn and pgconn.SafeToRetry checks that also belonged to the 'safe at any phase' bucket. Folded those two legs into isSafeRetryablePostgresError so it becomes the single 'safe to retry regardless of transaction phase' classifier (pre-send + server-rolled-back SQLSTATE codes), and assigned it directly to postgresTxClassifier.commit. isRetryablePostgresPreCommitError now delegates: isSafeRetryablePostgresError || isPostgresNetworkError. Behavior is unchanged; only the code structure is simpler. Also documented on isRetryablePostgresPreCommitError the permanent-error categories that are intentionally NOT retried pre-commit even though retrying them would be safe (the transaction rolls back): constraint violations (23xxx), syntax errors (42601), schema errors (42xxx), permission errors (42501), and data exceptions (22xxx). These surface at query time (not connection time) and would fail again on retry, so retrying only adds ~200ms of latency before the caller receives the error. Callers who want to retry a broader set can pass opts.IsRetryable. Tests: folded TestIsRetryablePostgresCommitError assertions into TestIsSafeRetryablePostgresError (added driver.ErrBadConn + SafeToRetry -> true); removed TestIsRetryablePostgresCommitError; TestIsRetryablePostgresPreCommitError unchanged. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
…config
Two related changes from review discussion:
1. Pre-commit classifier is now OPT-OUT. Pre-commit, retrying is always
safe (the transaction rolls back), so the cost of a false positive
(retrying a permanent error we didn't blocklist) is ~200ms of wasted
latency, while the cost of a false negative (NOT retrying a transient
error we didn't allowlist) is a spurious user-facing failure. The
former is strictly safer, so isRetryablePostgresPreCommitError now
retries everything EXCEPT context.Canceled/DeadlineExceeded and the
permanent SQLSTATE classes (22xxx data exceptions, 23xxx constraint
violations, 42xxx syntax/schema/permission), via a new
isPermanentPostgresError helper. Unknown *pgconn.PgError codes and
non-PgError errors are retried. The commit classifier
(isSafeRetryablePostgresError) stays OPT-IN (narrow) because at commit
retrying is not always safe. isPostgresNetworkError is removed (the
pre-commit catch-all now covers network errors; the commit classifier
never used it).
2. MaxAttempts is no longer configurable. Added a maxRetryAttempts=3
constant matching database/sql's maxBadConnRetries+1 convention, and
removed the MaxAttempts field from RetryTxOptions and RetryUnsafeOptions
(plus the '<= 0 -> default to 3' logic). Services that need a different
retry budget should bound the total time via the context deadline or
wrap the call in their own retry loop. Added a comment on retryTx
explaining the interaction with database/sql's internal driver.ErrBadConn
retry (immediate, up to 3 attempts for *sql.DB methods) — our outer
retry layers on top with full-jitter backoff to survive longer outages;
for *sql.Tx methods there is no internal retry, so this is the only one.
Tests: TestIsRetryablePostgresPreCommitError rewritten for opt-out
(application error and unknown PgError code now retryable; added 22xxx/
42xxx permanent cases); removed isPostgresNetworkError comment refs;
replaced all MaxAttempts: 3 with {} in RetryTxOptions/RetryUnsafeOptions.
Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
The driver.ErrBadConn check in isSafeRetryablePostgresError (the commit-phase
classifier) was unreachable dead code:
- *sql.Tx.Commit does not retry driver.ErrBadConn internally (only *sql.DB
methods have the maxBadConnRetries loop — see golang/go#11264), so there
is no database/sql retry to double up with at commit time.
- pgx's wrapTx.Commit returns the native pgx error directly
(return wtx.tx.Commit(wtx.ctx)); the SafeToRetry -> driver.ErrBadConn
conversion only happens in Conn.ExecContext/QueryContext (i.e., for fn's
tx.Exec/tx.Query), NOT for Commit. So the commit classifier never sees
driver.ErrBadConn with pgx.
- The pre-send commit case (a SafeToRetry-flagged pgx error from commit) is
already covered by the pgconn.SafeToRetry leg.
Removed the leg and the now-unused database/sql/driver import from postgres.go,
and documented on isSafeRetryablePostgresError why driver.ErrBadConn is not
checked there. driver.ErrBadConn from fn's tx.Exec is still retryable — that's
a pre-commit error handled by isRetryablePostgresPreCommitError's opt-out
catch-all (TestIsRetryablePostgresPreCommitError covers it).
TestIsSafeRetryablePostgresError: removed the driver.ErrBadConn -> true
assertions (now false) and added a comment explaining why driver.ErrBadConn
isn't in the commit classifier.
Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
Trim verbose comments, drop general-knowledge explanations (e.g. Rollback is a no-op after Commit), remove rationale duplicated across function docs, and fix stale references to the renamed isRetryablePostgresTxError in the mysql.go/spanner.go placeholder TODOs. Behavior unchanged. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
HealthCheckPeriod drives the background goroutine that evicts connections exceeded MaxConnLifetime/MaxConnIdleTime; it does NOT ping for liveness. Liveness pinging is at acquire time (stdlib ResetSession, default ping if idle > 1s), with database/sql retrying on a fresh conn. The old comment claimed HealthCheckPeriod pings idle connections in the background and evicts dead connections before the app sees them — both wrong. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
quickdie() (the SIGQUIT handler that raises ERRCODE_CRASH_SHUTDOWN) does _exit(2) WITHOUT rolling back — the source comment says 'we don't want to try to clean up our transaction'. The rollback is via crash recovery on restart, not before the client sees 57P02. Contrast with die() (57P01 admin_shutdown), which processes the interrupt through proc_exit/ AbortTransaction and does roll back before sending. So 57P02 at commit is ambiguous: the commit may have been recorded before SIGQUIT, and quickdie skips rollback. Removed 57P02 from isSafeRetryable- PostgresError (the commit classifier) so it's wrapped with ErrCommitPhase at commit (caller decides). 57P02 is still retried pre-commit via isRetryablePostgresPreCommitError's opt-out — the tx didn't commit (process exited before committing), so retry is safe there. Tests: TestIsSafeRetryablePostgresError asserts 57P02 -> false at commit; TestIsRetryablePostgresPreCommitError asserts 57P02 -> true pre-commit (with a comment noting it's safe despite no pre-return rollback). Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
Commit classifier (isSafeRetryablePostgresError) now retries only errors that
can NEVER mean 'possibly committed':
- pgconn.SafeToRetry (pgx guarantees pre-send: COMMIT message never sent)
- pgerrcode.IsTransactionRollback (class 40: server rolled back the tx
before returning the ErrorResponse — 40001 serialization_failure, 40P01
deadlock_detected, 40002, 40003, 40000)
Dropped 57P01 (admin_shutdown), 57P02 (crash_shutdown), 57014 (query_canceled),
57P03, 53300 from the commit classifier — each could mean 'possibly committed'
(signal handlers can interrupt after the commit is recorded; quickdie skips
rollback; cancel timing has an edge case; connect-time codes are irrelevant).
They're still retried pre-commit via isRetryablePostgresPreCommitError's
opt-out (the tx didn't commit pre-commit). At commit they're wrapped with
ErrCommitPhase so the caller decides.
Adopted github.com/jackc/pgerrcode (canonical Postgres error-code constants,
by the pgx author) to eliminate string literals and the inline human-friendly-
name comments: removed the local postgresErrUniqueViolation/postgresErrDeadlock-
Found consts; PostgresUniqueViolation/PostgresDeadlockFound use pgerrcode;
isPermanentPostgresError uses pgerrcode.IsDataException/IsIntegrityConstraint-
Violation/IsSyntaxErrororAccessRuleViolation (class 22/23/42); the commit
classifier uses pgerrcode.IsTransactionRollback (class 40).
Tests: TestIsSafeRetryablePostgresError now asserts only SafeToRetry + class 40
retryable at commit (57P01/57P02/57P03/53300/57014 false); added a
TestRetryPostgresTx subtest for 57P01-at-commit not retried + wrapped with
ErrCommitPhase; TestIsRetryablePostgresPreCommitError covers 57P01/57P02/57P03/
53300/57014 still retryable pre-commit; all string literals swapped for
pgerrcode constants (postgres_test.go, database_test.go too).
Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
…PostgresNonIdempotent Renames so the common case (selects, idempotent DML) doesn't put 'unsafe' all over callers' codebases, and the transactional variant's name signals it's for non-idempotent ops: RetryUnsafe -> RetryIdempotent RetryUnsafeOptions -> RetryIdempotentOptions RetryPostgresTx -> RetryPostgresNonIdempotent RetryMySQLTx -> RetryMySQLNonIdempotent (placeholder) RetrySpannerTx -> RetrySpannerNonIdempotent (placeholder) RetryTxOptions -> RetryNonIdempotentOptions Dropped the redundant 'Tx' suffix from the public names: the fn func(*sql.Tx) parameter makes the transaction obvious at every call site, so 'Tx' in the name added little. The internal retryTx helper keeps its name (unexported, descriptive). No behavior change. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
… + pg_xact_status
RetryPostgresNonIdempotent now verifies whether an ambiguous commit landed,
per r3470551946 and the PostgreSQL docs (pg_xact_status is documented for
'determine whether their transaction committed or aborted after the
application and database server become disconnected while a COMMIT is in
progress'). Always-on for the non-idempotent variant (no opt-in).
Before each Commit, capture the transaction id:
SELECT pg_current_xact_id_if_assigned()::text
(NULL → read-only, no writes to duplicate). On a commit-phase error the
commit classifier rejects (network, 57P01/57P02, etc.), query
pg_xact_status($1::xid8) on a FRESH connection (the original may be dead;
xid8 is globally unique so this works across connections/reconnects, modulo
a failover replay-lag window):
- 'aborted' → retry (commit didn't land).
- 'committed' → ErrCommitted (don't retry — would duplicate).
- NULL/'in progress'/error → ErrCommitPhase (ambiguous, caller decides).
- read-only (xid nil) → retry without verifying (no writes to duplicate).
Class-40 commit errors (40001/40P01, server rolled back) are retryable per
the commit classifier, so they're retried without verification; on exhaustion
they return the original error (not ErrCommitPhase — they're known
non-commit). ErrCommitPhase is now only for INCONCLUSIVE commits.
Implementation:
- ErrCommitted sentinel (distinct from ErrCommitPhase).
- commitStatus enum (aborted/committed/unknown).
- retryDB interface (BeginTx + QueryRowContext) so retryTx can run the
fresh-connection pg_xact_status query.
- retryClassifier gains captureXid/verifyCommit hooks (nil for future
MySQL/Spanner → ErrCommitPhase fallback, current behavior).
- runOneTxAttempt captures the xid before Commit and returns it.
- retryTx loop: on a classifier-rejected commit error, verify and decide
retry / ErrCommitted / ErrCommitPhase; read-only → retry.
- postgres.go: capturePostgresXid (::text cast sidesteps pgx's xid8 Go
type-mapping) and verifyPostgresCommit (::xid8 cast), wired into
postgresTxClassifier. Removed the TODO(future).
Tests: mock driver implements QueryerContext, scripting the two verification
queries (captureXidResult/verifyStatusResult/verifyErr). New TestRetryPostgres-
NonIdempotent subtests cover verify-aborted→retry, verify-committed→ErrCommitted,
verify-unknown→ErrCommitPhase, read-only→retry-without-verifying, verify-error
→ErrCommitPhase, and class-40-exhaustion→original-error. TestErrCommitted
added. ErrCommitPhase-exhaustion test repurposed (class-40 now returns the
original, not ErrCommitPhase).
Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
Now that commit verification (pg_xact_status) is in place, the separate 'commit classifier' (isSafeRetryablePostgresError) is unnecessary. Commit errors are rare, and we already pay a round-trip to capture the xid before every commit — so always checking pg_xact_status on a commit error (one more round-trip, only on the rare commit error) is negligible, and it lets the SERVER be the authority on whether the commit landed instead of a client-side guess. retryTx now: pre-commit errors → preCommit classifier (opt-out, unchanged); commit errors → always verify (read-only/xid-nil → retry without verifying; aborted → retry; committed → ErrCommitted; inconclusive → ErrCommitPhase). No commit classifier, no classify method. opts.IsRetryable is additive on preCommit only (commit errors are verified, not classified — safer, since a caller shouldn't override commit verification). Removed isSafeRetryablePostgresError (and its complex exclusion doc about 57P01/57P02/57014/57P03/53300/network/driver.ErrBadConn) and TestIsSafeRetryablePostgresError. postgresTxClassifier drops the commit field. Trade-off: class-40 / SafeToRetry commit errors now go through verification (+1 pg_xact_status round-trip; previously fast-pathed). No regression for common cases — class-40 → 'aborted' → retry (same outcome); the rare SafeToRetry+failover+replay-lag case → ErrCommitPhase (recoverable). Class-40 doesn't hit the replay-lag case (the server responded, no failover). Tests updated: SafeToRetry-commit subtest now verifies (verifyStatusResult= aborted, verifyCalls==1); class-40-exhaustion subtest verifies each attempt (verifyCalls==3) and still returns the original error (not ErrCommitPhase). Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Suggested implementation of the transaction-retry pattern from the review comments on this PR:
fntaking a transaction argument likepgx.BeginTxFunc) instead of an implicit one, so there are fewer failure scenarios where a retry could duplicate committed work. Also: capturepg_current_xact_id()before commit and consultpg_xact_status()on failure to see if the commit was successful.What this adds
RetryPostgresNonIdempotent(safe, transactional) — wrapsfnin an explicitBeginTx/Commit/Rollbackand retries the whole transaction on transient errors.fnreceives a*sql.Txeach attempt. This is the shape ofpgx.BeginTxFuncandpgxtransactions.Transaction, and whatjackcagreed should work. Use this for non-idempotent operations. Commit verification is always on (see below).RetryIdempotent(generic, idempotent-caller) — retriesfn(no transaction wrapper) on any error by default exceptcontext.Canceled/context.DeadlineExceeded. The caller vouches thatfnis idempotent. Lighter-weight option for selects and idempotent DML (no transaction overhead, no commit-verification round-trip).RetryMySQLNonIdempotent/RetrySpannerNonIdempotent(placeholders) — returnerrors.New("... not yet implemented")with TODOs. They shareRetryNonIdempotentOptionsand thefn func(*sql.Tx) errorsignature so the cross-backend API surface is visible.The existing
RetryPostgresis removed (this PR is unmerged). Callers should useRetryPostgresNonIdempotent(transactional) orRetryIdempotent(idempotent).Pre-commit classifier (opt-out) + commit verification (always-on)
Pre-commit errors (from
BeginTxorfn) — opt-OUT classifierisRetryablePostgresPreCommitError: pre-commit retry is always safe (the transaction rolls back), so it retries everything exceptcontext.Canceled/context.DeadlineExceededand the permanent SQLSTATE classes (22xxx,23xxx,42xxx— viapgerrcode.IsDataException/IsIntegrityConstraintViolation/IsSyntaxErrororAccessRuleViolation). Unknown*pgconn.PgErrorcodes and non-PgErrorerrors are retried; a false positive (~200ms wasted) costs less than a false negative (spurious user-facing failure).opts.IsRetryableis additive (OR) on this classifier.Commit errors (from
(*sql.Tx).Commit) — always verified, not classified. The server is the authority on whether a commit landed (a client-side guess can't tell committed-but-response-lost from genuinely-aborted), so instead of a "retryable commit" classifier we always checkpg_xact_status. Per r3470551946 and the PostgreSQL docs (which explicitly endorsepg_xact_statusfor "determine whether their transaction committed or aborted after the application and database server become disconnected while aCOMMITis in progress"):Commit, capture the transaction id:SELECT pg_current_xact_id_if_assigned()::text(NULL → read-only; the::textcast sidesteps pgx'sxid8Go type-mapping).SELECT pg_xact_status($1::xid8)on a fresh connection (the original may be dead;xid8is globally unique so this works across connections/reconnects, modulo a failover replay-lag window):'aborted'→ retry (the commit didn't land).'committed'→ErrCommitted(don't retry — would duplicate; the caller reconciles/alerts).NULL/'in progress'/ query errors →ErrCommitPhase(ambiguous; caller decides) - unlikely, but could happen during an asynchronous regional failover or true regional outage.A retryable commit error (verified-aborted or read-only) on exhaustion returns the original error (not
ErrCommitPhase— it's known non-commit, just out of retries).ErrCommitPhaseis only for inconclusive commits.Both preserve the underlying commit error for inspection via
errors.Is/errors.As.Why no commit classifier? An earlier iteration had a "maximally conservative" commit classifier (
isSafeRetryablePostgresError) that fast-pathedpgconn.SafeToRetry+ class-40 commit errors to retry without verification. But commit errors are rare, and the xid-capture round-trip is already paid on every commit — so always checkingpg_xact_statuson a commit error (one more round-trip, only on the rare commit error) is negligible, and it removes the whole client-side guessing exercise (the 57P01/57P02/57014/57P03/53300/network/driver.ErrBadConnexclusion reasoning). Class-40 →pg_xact_statussays'aborted'→ retry (same outcome, +1 round-trip); the rare SafeToRetry+failover+replay-lag case →ErrCommitPhase(recoverable).opts.IsRetryableno longer applies to commit errors — safer, since a caller shouldn't override commit verification.Trade-off: commit verification adds one round-trip per transaction (
pg_current_xact_id_if_assigned()before every commit), paid always but only used on rare commit errors. Always-on for the non-idempotent variant (where a duplicate-commit hurts most);RetryIdempotentskips it (no transaction).Non-configurable retry budget (
maxRetryAttempts = 3)MaxAttemptsis not configurable; both loops usemaxRetryAttempts = 3(in order to matchdatabase/sql'smaxBadConnRetries+1). Use the context deadline to bound total time, or wrap for more.database/sqlalready retriesdriver.ErrBadConnfor*sql.DBmethods (immediate, up to 3 attempts); this outer loop layers on top with jitter for longer outages.*sql.Txmethods have no internal retry, so this is the only retry fortx.Exec/tx.Commit.pgerrcodeadoptionAdopted
github.com/jackc/pgerrcode(canonical Postgres error-code constants, by the pgx author) to eliminate string literals and inline human-friendly-name comments:PostgresUniqueViolation/PostgresDeadlockFoundusepgerrcode.UniqueViolation/pgerrcode.DeadlockDetected;isPermanentPostgresErrorusespgerrcode.IsDataException/IsIntegrityConstraintViolation/IsSyntaxErrororAccessRuleViolation. Tests usepgerrcodeconstants throughout.Naming
RetryIdempotent(no transaction wrapper) — for selects and idempotent DML; the caller vouchesfnis idempotent. Chosen over "unsafe" so the common case doesn't put "unsafe" all over callers' codebases.RetryPostgresNonIdempotent(transactional + commit verification) — for non-idempotent ops. TheTxsuffix was dropped (thefn func(*sql.Tx) errorparameter makes the transaction obvious at the call site).RetryIdempotentOptions/RetryNonIdempotentOptions(shared by the Postgres/MySQL/Spanner non-idempotent variants) — parallel to the function names.retryTxhelper keeps its name (unexported, descriptive).Pluggable retryable cases
RetryNonIdempotentOptions.IsRetryableis additive (OR) on the pre-commit classifier (commit errors are always verified, not classified).RetryIdempotentOptions.IsRetryablereplaces the default (nil → retry any error except context cancellation/deadline).Out of scope (follow-ups)
RetryMySQLNonIdempotent/RetrySpannerNonIdempotent— placeholders only (nopg_xact_statusequivalent wired; they'd fall back toErrCommitPhasefor commit errors until each backend provides a verifier).sql.DB.RetryPostgresNonIdempotentmethod on thesql/db.gowrapper — deferred (would add asql→databasedependency).pg_current_xact_id_if_assigned/pg_xact_statusreturns; a real-PG integration test would confirm the SQL andxid8round-tripping.Test plan
TestRetryIdempotent— success, retries on any error by default (incl.23505), customIsRetryableshort-circuits, no retry oncontext.Canceled, ctx cancellation, attempt exhaustion.TestRetryPostgresNonIdempotent— success; fn-error-then-succeed (rollback between attempts); non-retryable fn error short-circuits (23505);driver.ErrBadConnfrom fn retried;SafeToRetryBegin failure retried;SafeToRetryCommit failure → verified-aborted → retry; verify-aborted → retry; verify-committed →ErrCommitted; verify-unknown (NULL) →ErrCommitPhase; verify-error →ErrCommitPhase; read-only (no xid) → retry without verifying; class-40 exhaustion → verify-aborted each attempt → original error (notErrCommitPhase);57P01+inconclusive →ErrCommitPhase; additiveopts.IsRetryable(pre-commit); ctx cancellation; attempt exhaustion. Uses a minimaldatabase/sql/drivermock (withQueryerContextto script the verification queries).TestIsRetryablePostgresPreCommitError— pre-commit classifier leg coverage.TestErrCommitPhase/TestErrCommitted— sentinel detection; underlying error preserved; distinct from each other.TestRetryMySQLNonIdempotentNotImplemented/TestRetrySpannerNonIdempotentNotImplemented— placeholders return "not yet implemented".go build ./...,go vet ./...,go test ./database/... -shortall pass.