database: add failover-safe pool defaults and pgxpool health checks for Postgres/AlloyDB by stevemsmith · Pull Request #490 · moov-io/base

stevemsmith · 2026-03-17T18:52:00Z

Summary

Switch Postgres/AlloyDB connections from `sql.Open("pgx")` to `pgxpool` with `stdlib.OpenDBFromPool()`, and configure `ShouldPing` to detect dead connections at acquire time.

Problem

During AlloyDB maintenance switchovers, Go services fail because dead connections sit in the pool and get handed to the application. The error looks like:

```
failed to connect to `user=tumbler_sa database=tumbler`: 10.2.10.25:5432 (10.2.10.25): tls error: read tcp ...: read: connection reset by peer
```

Without this change, services continue returning 500s until the pool naturally cycles out the dead connections — which with default `database/sql` settings can take a long time.

Changes

`postgres.go`

Replace `sql.Open("pgx", connStr)` with `pgxpool.NewWithConfig()` + `stdlib.OpenDBFromPool()`
Set `ShouldPing` to ping connections idle > 200ms before handing them to the caller. This is the mechanism that actually tests liveness at acquire time — `HealthCheckPeriod` (the background reaper) only evicts connections exceeding age thresholds and does not ping.
Refactor `getAlloyDBConnectorConnStr()` → `buildAlloyDBPoolConfig()` to return `*pgxpool.Config`

`database.go`

Remove `ApplyPostgresConnectionsConfig` — `sql.DB` setters have no effect on the underlying pgxpool, and calling `SetMaxIdleConns` with a non-zero value actively breaks pgxpool by preventing connections from being released back to the pool. `OpenDBFromPool` already sets `MaxIdleConns(0)` automatically.

`model_config.go`

Remove `DefaultPostgresConnectionsConfig()` — no longer needed

Out of scope

Retry logic has been intentionally deferred to a follow-up PR. Retrying in-flight queries on connection failure requires explicit transaction handling to avoid double-writes — that deserves its own design and review.

Non-breaking

Return type of `New()` is still `*sql.DB` — no downstream changes needed
MySQL and Spanner paths are untouched
No new external dependencies — `pgxpool` and `stdlib` are both part of `pgx/v5`

Test plan

`go build ./database/...` and `go vet ./database/...` pass
Integration test during next AlloyDB maintenance window

gemini-code-assist · 2026-03-17T18:52:30Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the robustness of database connections for Postgres and AlloyDB by implementing proactive health checks, establishing intelligent default connection pool settings, and providing utilities to automatically retry operations affected by transient connection issues. These changes aim to enhance service stability and recovery during database failovers or brief network disruptions, without requiring modifications to existing application code.

Highlights

Enhanced Postgres/AlloyDB Connection Resilience: Switched Postgres/AlloyDB connections from sql.Open("pgx") to pgxpool with stdlib.OpenDBFromPool(), enabling HealthCheckPeriod (1s) to proactively evict dead connections after a failover, while maintaining *sql.DB return type for compatibility.
Failover-Safe Connection Pool Defaults: Added safe connection pool defaults (MaxLifetime=5m, MaxIdleTime=30s, MaxOpen=25, MaxIdle=5) that are automatically applied when services do not explicitly configure them, preventing dead connections from lingering.
Transient Error Retry Utilities: Introduced IsRetryablePostgresError() to classify transient connection errors (e.g., admin_shutdown, cannot_connect_now, network errors) and RetryPostgres() for opt-in retry logic with linear backoff for critical database operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

database/database.go
- Updated the New function to utilize ApplyPostgresConnectionsConfig for Postgres connections.
- Added ApplyPostgresConnectionsConfig to apply default connection pool settings for Postgres/AlloyDB, filling in zero-valued configuration fields with predefined safe defaults.
database/database_test.go
- Added TestApplyPostgresConnectionsConfig_Defaults to verify that default connection settings are correctly applied.
- Added TestApplyPostgresConnectionsConfig_Overrides to confirm that explicitly set connection values override defaults.
- Added TestDefaultPostgresConnectionsConfig to validate the properties of the default Postgres connection configuration.
database/model_config.go
- Introduced DefaultPostgresConnectionsConfig function, providing failover-tuned default values for Postgres connection pool settings.
database/postgres.go
- Refactored postgresConnection to leverage pgxpool.NewWithConfig and stdlib.OpenDBFromPool for enhanced connection management and health checking.
- Configured pgxpool with a HealthCheckPeriod of 1 second to proactively ping and evict dead connections.
- Renamed and refactored getAlloyDBConnectorConnStr to buildAlloyDBPoolConfig to return a *pgxpool.Config.
- Introduced buildPgxPoolConfig to abstract the creation of pgxpool.Config for both standard Postgres and AlloyDB connections.
- Added IsRetryablePostgresError function to identify transient PostgreSQL error codes and network errors that are safe to retry.
- Implemented RetryPostgres utility for executing database operations with linear backoff on retryable errors.
database/postgres_test.go
- Added comprehensive unit tests for IsRetryablePostgresError, covering various PostgreSQL error codes, network errors, and non-retryable scenarios.
- Added unit tests for RetryPostgres to validate successful execution, retry behavior, non-retryable error handling, context cancellation, and attempt exhaustion.

Activity

No specific activity (comments, reviews, or progress updates) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces robust failover handling for Postgres/AlloyDB connections by switching to pgxpool, which enables proactive health checks to evict dead connections. It also adds sensible default connection pool settings tuned for failover scenarios and an opt-in retry mechanism for transient connection errors. The changes are well-implemented, non-breaking for consumers of the library, and thoroughly tested. My feedback includes a couple of suggestions to improve code conciseness and strengthen test assertions.

stevemsmith · 2026-03-17T19:08:46Z

CI is failing on a pre-existing vulnerability in golang.org/x/net@v0.50.0 (GO-2026-4559) — unrelated to the changes in this PR. Looks like #487 (Renovate) already bumps this dependency.

adamdecaf · 2026-03-17T19:55:31Z

Can we try this in some apps before merging?

adamdecaf · 2026-03-17T19:56:09Z

I'll work with gosec/golangci-lint on getting the linter fix released.

adamdecaf · 2026-03-27T19:07:38Z

/gemini review

gemini-code-assist

Code Review

This pull request enhances PostgreSQL and AlloyDB connection management by transitioning to pgxpool with background health checks and introducing tuned default pool settings for better failover recovery. It also adds a RetryPostgres utility and an IsRetryablePostgresError helper to handle transient connection-level errors. I have no feedback to provide.

adamdecaf · 2026-03-27T19:12:45Z

@stevemsmith can you rebase off the latest origin/master ?

InfernoJJ

Remove the retry stuff, otherwise looks good.

InfernoJJ · 2026-03-31T21:26:39Z

+			return err
+		}
+		if attempt < maxAttempts-1 {
+			backoff := time.Duration(attempt+1) * 200 * time.Millisecond


Backoff is a bad idea in this case as you'll drastically increase the wait, along with it not being configurable. 200ms is an ETERNITY to a program, and is noticeable by a user. Theirs also no variance in this interface, so they will all slam at the same time.

Got it. Pushed a commit to replace the backoff with jitter in a 0–100ms window per retry.

Calling services can control patience through the context timeout. In connections for example, the queryScanConnection method can wrap its context with a 1 second deadline before the query runs. If the retry loop hasn't gotten a working connection within that second across all 3 attempts, it stops and returns an error.

The service responds with a 500, same as today, but only after trying to recover from a dead connection rather than giving up on the first failure.

Does this address the concern? We need to retry as that's dictated by the AlloyDB patching mechanism.

…oyDB During AlloyDB maintenance switchovers (< 1s downtime), services using database/sql with pgx fail to recover because: 1. No connection pool limits are set by default, so dead connections persist indefinitely 2. No retry logic exists for transient connection errors This adds: - DefaultPostgresConnectionsConfig() with MaxLifetime=5m, MaxIdleTime=30s to ensure dead connections are evicted quickly after failover - ApplyPostgresConnectionsConfig() that fills in safe defaults when services don't explicitly configure pool settings - IsRetryablePostgresError() to classify transient PG/network errors - RetryPostgres() for services to wrap critical DB operations Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Switch from sql.Open("pgx") to pgxpool.NewWithConfig() wrapped with stdlib.OpenDBFromPool(). This gives us pgxpool's HealthCheckPeriod (set to 5s) which proactively pings idle connections and evicts dead ones — the most important fix for surviving AlloyDB maintenance switchovers. The return type remains *sql.DB so no downstream changes are needed. Also cleans up the dialer TODO (dialer lifecycle is now tied to the pool) and removes the unused pgx.ParseConfig import path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

For sub-second AlloyDB switchovers, 1s health checks detect and evict dead connections faster with negligible overhead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Collapse switch cases in IsRetryablePostgresError for readability - Make context cancellation test more specific (assert context.Canceled and exact call count) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… codes Remove the 08xxx connection_exception PgError cases — pgx never surfaces these as *pgconn.PgError, so they were dead code. Add the codes that pgx does return and where the server guarantees a rollback before returning the error: 40001 (serialization_failure), 40P01 (deadlock_detected), 53300 (too_many_connections), and 57014 (query_canceled). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Replace the fixed 200ms×attempt linear backoff with a random delay in [0, 100ms) per retry gap. Full jitter spreads concurrent retries across the fleet instead of having all callers wait the same fixed interval and slam the database simultaneously. Worst-case retry window across 3 attempts is ~200ms, sized for AlloyDB planned maintenance switchovers which typically cause less than one second of downtime. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Drop "connection refused" (not present in pgx's codebase) and "unexpected EOF" (already caught by errors.Is(err, io.ErrUnexpectedEOF)). Add comments explaining why the remaining string matches are safe to retry. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Addresses review comments r3482973525 and r3483137216 on PR moov-io#490: without an explicit transaction there is no way to know whether a network error occurred before or after the server accepted the query, so retrying could duplicate committed work. IsRetryablePostgresError now classifies only PostgreSQL SQLSTATE codes that the server guarantees were rolled back before the error was returned (40001, 40P01, 57P01, 57P02, 57P03, 53300, 57014). The net.OpError, io.EOF, io.ErrUnexpectedEOF, and string-matching ("connection reset by peer"/"broken pipe"/"conn closed") checks are removed. The context.DeadlineExceeded check is removed as redundant (non-PgError errors already return false). The doc comment explains why network errors are excluded and points callers to RetryPostgresTx for transactional retry. The network-error checks move into a new internal isPostgresNetworkError helper, which is wired into isRetryablePostgresTxError as a 4th leg. Network errors ARE safe to retry within an explicit transaction (the server rolls back the uncommitted transaction), so RetryPostgresTx keeps them. String matching is dropped entirely -- driver.ErrBadConn and pgconn.SafeToRetry cover pgx's pre-send cases, and the typed net.OpError / io checks cover post-send cases, so substring matching is redundant and fragile. Tests updated: - TestIsRetryablePostgresError: io.EOF, io.ErrUnexpectedEOF, net.OpError, and the string-matched errors now assert false; SQLSTATE assertions unchanged. - TestIsRetryablePostgresTxError: added a Leg 4 section asserting io.EOF, io.ErrUnexpectedEOF, and net.OpError are retryable via the transactional classifier. Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>

stevemsmith · 2026-06-29T23:46:04Z

Can we try this in some apps before merging?

Yes, I'm running the changes in one service, will try a few others.

IsRetryablePostgresError and RetryPostgres are moved to a follow-up PR where the explicit transaction / pg_xact_status approach can be designed properly. This PR focuses solely on pool health: pgxpool with HealthCheckPeriod=1s evicts dead connections within ~1s of an AlloyDB switchover so the pool is healthy before the next request arrives. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Move getPostgresConnStr back to before buildAlloyDBPoolConfig to match the pre-PR ordering and keep the diff readable. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

elovelan

There are a few other tuning things that Claude recommended that I want to understand better, but I wanted to submit this because these comments are the most important to address.

jwbraucher

reverting my approval - I agree with @elovelan that the healthcheck does not include a ping()

…on detection HealthCheckPeriod only evicts connections that have exceeded their age thresholds — it does not ping. ShouldPing fires at acquire time and sends an actual liveness check to the server before handing the connection to the caller. Ping if idle > 200ms catches dead connections from an AlloyDB switchover without adding overhead on hot connections used moments ago. Remove ApplyPostgresConnectionsConfig and DefaultPostgresConnectionsConfig: sql.DB setters have no effect on the underlying pgxpool, and SetMaxIdleConns with a non-zero value actively breaks pgxpool by preventing connections from being released back to the pool. OpenDBFromPool already sets MaxIdleConns(0) automatically. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

stevemsmith · 2026-07-01T23:14:04Z

Removed DefaultPostgresConnectionsConfig and ApplyPostgresConnectionsConfig to keep it simpler.

Connection lifetime and pool size tuning are out of scope for this PR, the goal is dead connection detection, not pool rotation policy. These can be revisited in a follow-up if needed.

…base when using pgxpool.

…patch Preserve ability to tune (some of) the connection config

elovelan

looks great! 🥳

stevemsmith requested review from InfernoJJ and adamdecaf as code owners March 17, 2026 18:52

gemini-code-assist Bot reviewed Mar 17, 2026

View reviewed changes

Comment thread database/postgres.go Outdated

Comment thread database/postgres_test.go Outdated

adamdecaf previously approved these changes Mar 17, 2026

View reviewed changes

stevemsmith dismissed adamdecaf’s stale review via 48cbb3b March 17, 2026 19:10

adamdecaf previously approved these changes Mar 17, 2026

View reviewed changes

adamdecaf self-requested a review March 27, 2026 19:07

gemini-code-assist Bot reviewed Mar 27, 2026

View reviewed changes

adamdecaf force-pushed the fix/alloydb-failover-recovery branch from 48cbb3b to 22a1b6f Compare March 30, 2026 16:15

InfernoJJ requested changes Mar 31, 2026

View reviewed changes

elovelan reviewed Jun 24, 2026

View reviewed changes

Comment thread database/postgres.go Outdated

Comment thread database/postgres.go Outdated

stevemsmith and others added 4 commits June 25, 2026 12:41

database: lower HealthCheckPeriod from 5s to 1s

c3e3eda

For sub-second AlloyDB switchovers, 1s health checks detect and evict dead connections faster with negligible overhead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

database: address review feedback from Gemini

b56ff9e

- Collapse switch cases in IsRetryablePostgresError for readability - Make context cancellation test more specific (assert context.Canceled and exact call count) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

stevemsmith force-pushed the fix/alloydb-failover-recovery branch from 22a1b6f to b56ff9e Compare June 25, 2026 18:41

stevemsmith dismissed adamdecaf’s stale review via aea2db6 June 25, 2026 18:56

stevemsmith and others added 2 commits June 25, 2026 13:44

stevemsmith requested a review from InfernoJJ June 25, 2026 19:55

elovelan reviewed Jun 28, 2026

View reviewed changes

Comment thread database/postgres.go Outdated

Comment thread database/postgres.go Outdated

Comment thread database/postgres.go Outdated

This was referenced Jun 29, 2026

database: split RetryPostgres into safe RetryPostgresTx and generic RetryUnsafe elovelan/moovio-base#1

Draft

database: split RetryPostgres into RetryIdempotent and RetryPostgresNonIdempotent stevemsmith/base#2

Closed

elovelan mentioned this pull request Jun 30, 2026

database: Postgres/AlloyDB network reliability — failover-safe pool + transactional retry with commit verification elovelan/moovio-base#2

Open

7 tasks

stevemsmith and others added 3 commits June 30, 2026 15:10

database: restore original position of getPostgresConnStr

4e5e7c5

Move getPostgresConnStr back to before buildAlloyDBPoolConfig to match the pre-PR ordering and keep the diff readable. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

database: remove double blank line leftover from retry test removal

4f89e83

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

InfernoJJ previously approved these changes Jun 30, 2026

View reviewed changes

jwbraucher approved these changes Jul 1, 2026

View reviewed changes

elovelan reviewed Jul 1, 2026

View reviewed changes

Comment thread database/postgres.go Outdated

Comment thread database/database.go Outdated

Comment thread database/postgres.go

jwbraucher reviewed Jul 1, 2026

View reviewed changes

stevemsmith dismissed InfernoJJ’s stale review via 596686c July 1, 2026 23:07

jwbraucher and others added 2 commits July 2, 2026 11:59

Preserve ability to tune (some of) the connection config for the data…

f6d8d4c

…base when using pgxpool.

Merge pull request #3 from jwbraucher/fix-alloy-db-failover-recovery-…

cc6dca8

…patch Preserve ability to tune (some of) the connection config

jwbraucher approved these changes Jul 2, 2026

View reviewed changes

elovelan approved these changes Jul 2, 2026

View reviewed changes

Uh oh!

Conversation

stevemsmith commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

`postgres.go`

`database.go`

`model_config.go`

Out of scope

Non-breaking

Test plan

Uh oh!

gemini-code-assist Bot commented Mar 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

stevemsmith commented Mar 17, 2026

Uh oh!

adamdecaf commented Mar 17, 2026

Uh oh!

adamdecaf commented Mar 17, 2026

Uh oh!

adamdecaf commented Mar 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

adamdecaf commented Mar 27, 2026

Uh oh!

InfernoJJ left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

InfernoJJ Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

stevemsmith Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stevemsmith commented Jun 29, 2026

Uh oh!

elovelan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jwbraucher left a comment

Choose a reason for hiding this comment

Uh oh!

stevemsmith commented Jul 1, 2026

Uh oh!

elovelan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

stevemsmith commented Mar 17, 2026 •

edited

Loading