database: add failover-safe pool defaults and pgxpool health checks for Postgres/AlloyDB#490
database: add failover-safe pool defaults and pgxpool health checks for Postgres/AlloyDB#490stevemsmith wants to merge 13 commits into
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the robustness of database connections for Postgres and AlloyDB by implementing proactive health checks, establishing intelligent default connection pool settings, and providing utilities to automatically retry operations affected by transient connection issues. These changes aim to enhance service stability and recovery during database failovers or brief network disruptions, without requiring modifications to existing application code. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces robust failover handling for Postgres/AlloyDB connections by switching to pgxpool, which enables proactive health checks to evict dead connections. It also adds sensible default connection pool settings tuned for failover scenarios and an opt-in retry mechanism for transient connection errors. The changes are well-implemented, non-breaking for consumers of the library, and thoroughly tested. My feedback includes a couple of suggestions to improve code conciseness and strengthen test assertions.
|
CI is failing on a pre-existing vulnerability in |
|
Can we try this in some apps before merging? |
|
I'll work with gosec/golangci-lint on getting the linter fix released. |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request enhances PostgreSQL and AlloyDB connection management by transitioning to pgxpool with background health checks and introducing tuned default pool settings for better failover recovery. It also adds a RetryPostgres utility and an IsRetryablePostgresError helper to handle transient connection-level errors. I have no feedback to provide.
|
@stevemsmith can you rebase off the latest origin/master ? |
48cbb3b to
22a1b6f
Compare
InfernoJJ
left a comment
There was a problem hiding this comment.
Remove the retry stuff, otherwise looks good.
| return err | ||
| } | ||
| if attempt < maxAttempts-1 { | ||
| backoff := time.Duration(attempt+1) * 200 * time.Millisecond |
There was a problem hiding this comment.
Backoff is a bad idea in this case as you'll drastically increase the wait, along with it not being configurable. 200ms is an ETERNITY to a program, and is noticeable by a user. Theirs also no variance in this interface, so they will all slam at the same time.
There was a problem hiding this comment.
Got it. Pushed a commit to replace the backoff with jitter in a 0–100ms window per retry.
Calling services can control patience through the context timeout. In connections for example, the queryScanConnection method can wrap its context with a 1 second deadline before the query runs. If the retry loop hasn't gotten a working connection within that second across all 3 attempts, it stops and returns an error.
The service responds with a 500, same as today, but only after trying to recover from a dead connection rather than giving up on the first failure.
Does this address the concern? We need to retry as that's dictated by the AlloyDB patching mechanism.
…oyDB During AlloyDB maintenance switchovers (< 1s downtime), services using database/sql with pgx fail to recover because: 1. No connection pool limits are set by default, so dead connections persist indefinitely 2. No retry logic exists for transient connection errors This adds: - DefaultPostgresConnectionsConfig() with MaxLifetime=5m, MaxIdleTime=30s to ensure dead connections are evicted quickly after failover - ApplyPostgresConnectionsConfig() that fills in safe defaults when services don't explicitly configure pool settings - IsRetryablePostgresError() to classify transient PG/network errors - RetryPostgres() for services to wrap critical DB operations Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch from sql.Open("pgx") to pgxpool.NewWithConfig() wrapped with
stdlib.OpenDBFromPool(). This gives us pgxpool's HealthCheckPeriod
(set to 5s) which proactively pings idle connections and evicts dead
ones — the most important fix for surviving AlloyDB maintenance
switchovers. The return type remains *sql.DB so no downstream changes
are needed.
Also cleans up the dialer TODO (dialer lifecycle is now tied to the
pool) and removes the unused pgx.ParseConfig import path.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
For sub-second AlloyDB switchovers, 1s health checks detect and evict dead connections faster with negligible overhead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Collapse switch cases in IsRetryablePostgresError for readability - Make context cancellation test more specific (assert context.Canceled and exact call count) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
22a1b6f to
b56ff9e
Compare
… codes Remove the 08xxx connection_exception PgError cases — pgx never surfaces these as *pgconn.PgError, so they were dead code. Add the codes that pgx does return and where the server guarantees a rollback before returning the error: 40001 (serialization_failure), 40P01 (deadlock_detected), 53300 (too_many_connections), and 57014 (query_canceled). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Replace the fixed 200ms×attempt linear backoff with a random delay in [0, 100ms) per retry gap. Full jitter spreads concurrent retries across the fleet instead of having all callers wait the same fixed interval and slam the database simultaneously. Worst-case retry window across 3 attempts is ~200ms, sized for AlloyDB planned maintenance switchovers which typically cause less than one second of downtime. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Drop "connection refused" (not present in pgx's codebase) and "unexpected EOF" (already caught by errors.Is(err, io.ErrUnexpectedEOF)). Add comments explaining why the remaining string matches are safe to retry. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Addresses review comments r3482973525 and r3483137216 on PR moov-io#490: without an explicit transaction there is no way to know whether a network error occurred before or after the server accepted the query, so retrying could duplicate committed work. IsRetryablePostgresError now classifies only PostgreSQL SQLSTATE codes that the server guarantees were rolled back before the error was returned (40001, 40P01, 57P01, 57P02, 57P03, 53300, 57014). The net.OpError, io.EOF, io.ErrUnexpectedEOF, and string-matching ("connection reset by peer"/"broken pipe"/"conn closed") checks are removed. The context.DeadlineExceeded check is removed as redundant (non-PgError errors already return false). The doc comment explains why network errors are excluded and points callers to RetryPostgresTx for transactional retry. The network-error checks move into a new internal isPostgresNetworkError helper, which is wired into isRetryablePostgresTxError as a 4th leg. Network errors ARE safe to retry within an explicit transaction (the server rolls back the uncommitted transaction), so RetryPostgresTx keeps them. String matching is dropped entirely -- driver.ErrBadConn and pgconn.SafeToRetry cover pgx's pre-send cases, and the typed net.OpError / io checks cover post-send cases, so substring matching is redundant and fragile. Tests updated: - TestIsRetryablePostgresError: io.EOF, io.ErrUnexpectedEOF, net.OpError, and the string-matched errors now assert false; SQLSTATE assertions unchanged. - TestIsRetryablePostgresTxError: added a Leg 4 section asserting io.EOF, io.ErrUnexpectedEOF, and net.OpError are retryable via the transactional classifier. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
Yes, I'm running the changes in one service, will try a few others. |
IsRetryablePostgresError and RetryPostgres are moved to a follow-up PR where the explicit transaction / pg_xact_status approach can be designed properly. This PR focuses solely on pool health: pgxpool with HealthCheckPeriod=1s evicts dead connections within ~1s of an AlloyDB switchover so the pool is healthy before the next request arrives. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Move getPostgresConnStr back to before buildAlloyDBPoolConfig to match the pre-PR ordering and keep the diff readable. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
elovelan
left a comment
There was a problem hiding this comment.
There are a few other tuning things that Claude recommended that I want to understand better, but I wanted to submit this because these comments are the most important to address.
jwbraucher
left a comment
There was a problem hiding this comment.
reverting my approval - I agree with @elovelan that the healthcheck does not include a ping()
…on detection HealthCheckPeriod only evicts connections that have exceeded their age thresholds — it does not ping. ShouldPing fires at acquire time and sends an actual liveness check to the server before handing the connection to the caller. Ping if idle > 200ms catches dead connections from an AlloyDB switchover without adding overhead on hot connections used moments ago. Remove ApplyPostgresConnectionsConfig and DefaultPostgresConnectionsConfig: sql.DB setters have no effect on the underlying pgxpool, and SetMaxIdleConns with a non-zero value actively breaks pgxpool by preventing connections from being released back to the pool. OpenDBFromPool already sets MaxIdleConns(0) automatically. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
Removed Connection lifetime and pool size tuning are out of scope for this PR, the goal is dead connection detection, not pool rotation policy. These can be revisited in a follow-up if needed. |
…base when using pgxpool.
…patch Preserve ability to tune (some of) the connection config
Summary
Switch Postgres/AlloyDB connections from `sql.Open("pgx")` to `pgxpool` with `stdlib.OpenDBFromPool()`, and configure `ShouldPing` to detect dead connections at acquire time.
Problem
During AlloyDB maintenance switchovers, Go services fail because dead connections sit in the pool and get handed to the application. The error looks like:
```
failed to connect to `user=tumbler_sa database=tumbler`: 10.2.10.25:5432 (10.2.10.25): tls error: read tcp ...: read: connection reset by peer
```
Without this change, services continue returning 500s until the pool naturally cycles out the dead connections — which with default `database/sql` settings can take a long time.
Changes
`postgres.go`
`database.go`
`model_config.go`
Out of scope
Retry logic has been intentionally deferred to a follow-up PR. Retrying in-flight queries on connection failure requires explicit transaction handling to avoid double-writes — that deserves its own design and review.
Non-breaking
Test plan