Skip to content

database: add failover-safe pool defaults and pgxpool health checks for Postgres/AlloyDB#490

Open
stevemsmith wants to merge 13 commits into
moov-io:masterfrom
stevemsmith:fix/alloydb-failover-recovery
Open

database: add failover-safe pool defaults and pgxpool health checks for Postgres/AlloyDB#490
stevemsmith wants to merge 13 commits into
moov-io:masterfrom
stevemsmith:fix/alloydb-failover-recovery

Conversation

@stevemsmith

@stevemsmith stevemsmith commented Mar 17, 2026

Copy link
Copy Markdown

Summary

Switch Postgres/AlloyDB connections from `sql.Open("pgx")` to `pgxpool` with `stdlib.OpenDBFromPool()`, and configure `ShouldPing` to detect dead connections at acquire time.

Problem

During AlloyDB maintenance switchovers, Go services fail because dead connections sit in the pool and get handed to the application. The error looks like:

```
failed to connect to `user=tumbler_sa database=tumbler`: 10.2.10.25:5432 (10.2.10.25): tls error: read tcp ...: read: connection reset by peer
```

Without this change, services continue returning 500s until the pool naturally cycles out the dead connections — which with default `database/sql` settings can take a long time.

Changes

`postgres.go`

  • Replace `sql.Open("pgx", connStr)` with `pgxpool.NewWithConfig()` + `stdlib.OpenDBFromPool()`
  • Set `ShouldPing` to ping connections idle > 200ms before handing them to the caller. This is the mechanism that actually tests liveness at acquire time — `HealthCheckPeriod` (the background reaper) only evicts connections exceeding age thresholds and does not ping.
  • Refactor `getAlloyDBConnectorConnStr()` → `buildAlloyDBPoolConfig()` to return `*pgxpool.Config`

`database.go`

  • Remove `ApplyPostgresConnectionsConfig` — `sql.DB` setters have no effect on the underlying pgxpool, and calling `SetMaxIdleConns` with a non-zero value actively breaks pgxpool by preventing connections from being released back to the pool. `OpenDBFromPool` already sets `MaxIdleConns(0)` automatically.

`model_config.go`

  • Remove `DefaultPostgresConnectionsConfig()` — no longer needed

Out of scope

Retry logic has been intentionally deferred to a follow-up PR. Retrying in-flight queries on connection failure requires explicit transaction handling to avoid double-writes — that deserves its own design and review.

Non-breaking

  • Return type of `New()` is still `*sql.DB` — no downstream changes needed
  • MySQL and Spanner paths are untouched
  • No new external dependencies — `pgxpool` and `stdlib` are both part of `pgx/v5`

Test plan

  • `go build ./database/...` and `go vet ./database/...` pass
  • Integration test during next AlloyDB maintenance window

@gemini-code-assist

Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the robustness of database connections for Postgres and AlloyDB by implementing proactive health checks, establishing intelligent default connection pool settings, and providing utilities to automatically retry operations affected by transient connection issues. These changes aim to enhance service stability and recovery during database failovers or brief network disruptions, without requiring modifications to existing application code.

Highlights

  • Enhanced Postgres/AlloyDB Connection Resilience: Switched Postgres/AlloyDB connections from sql.Open("pgx") to pgxpool with stdlib.OpenDBFromPool(), enabling HealthCheckPeriod (1s) to proactively evict dead connections after a failover, while maintaining *sql.DB return type for compatibility.
  • Failover-Safe Connection Pool Defaults: Added safe connection pool defaults (MaxLifetime=5m, MaxIdleTime=30s, MaxOpen=25, MaxIdle=5) that are automatically applied when services do not explicitly configure them, preventing dead connections from lingering.
  • Transient Error Retry Utilities: Introduced IsRetryablePostgresError() to classify transient connection errors (e.g., admin_shutdown, cannot_connect_now, network errors) and RetryPostgres() for opt-in retry logic with linear backoff for critical database operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • database/database.go
    • Updated the New function to utilize ApplyPostgresConnectionsConfig for Postgres connections.
    • Added ApplyPostgresConnectionsConfig to apply default connection pool settings for Postgres/AlloyDB, filling in zero-valued configuration fields with predefined safe defaults.
  • database/database_test.go
    • Added TestApplyPostgresConnectionsConfig_Defaults to verify that default connection settings are correctly applied.
    • Added TestApplyPostgresConnectionsConfig_Overrides to confirm that explicitly set connection values override defaults.
    • Added TestDefaultPostgresConnectionsConfig to validate the properties of the default Postgres connection configuration.
  • database/model_config.go
    • Introduced DefaultPostgresConnectionsConfig function, providing failover-tuned default values for Postgres connection pool settings.
  • database/postgres.go
    • Refactored postgresConnection to leverage pgxpool.NewWithConfig and stdlib.OpenDBFromPool for enhanced connection management and health checking.
    • Configured pgxpool with a HealthCheckPeriod of 1 second to proactively ping and evict dead connections.
    • Renamed and refactored getAlloyDBConnectorConnStr to buildAlloyDBPoolConfig to return a *pgxpool.Config.
    • Introduced buildPgxPoolConfig to abstract the creation of pgxpool.Config for both standard Postgres and AlloyDB connections.
    • Added IsRetryablePostgresError function to identify transient PostgreSQL error codes and network errors that are safe to retry.
    • Implemented RetryPostgres utility for executing database operations with linear backoff on retryable errors.
  • database/postgres_test.go
    • Added comprehensive unit tests for IsRetryablePostgresError, covering various PostgreSQL error codes, network errors, and non-retryable scenarios.
    • Added unit tests for RetryPostgres to validate successful execution, retry behavior, non-retryable error handling, context cancellation, and attempt exhaustion.
Activity
  • No specific activity (comments, reviews, or progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces robust failover handling for Postgres/AlloyDB connections by switching to pgxpool, which enables proactive health checks to evict dead connections. It also adds sensible default connection pool settings tuned for failover scenarios and an opt-in retry mechanism for transient connection errors. The changes are well-implemented, non-breaking for consumers of the library, and thoroughly tested. My feedback includes a couple of suggestions to improve code conciseness and strengthen test assertions.

Comment thread database/postgres.go Outdated
Comment thread database/postgres_test.go Outdated
adamdecaf
adamdecaf previously approved these changes Mar 17, 2026
@stevemsmith

Copy link
Copy Markdown
Author

CI is failing on a pre-existing vulnerability in golang.org/x/net@v0.50.0 (GO-2026-4559) — unrelated to the changes in this PR. Looks like #487 (Renovate) already bumps this dependency.

adamdecaf
adamdecaf previously approved these changes Mar 17, 2026
@adamdecaf

Copy link
Copy Markdown
Member

Can we try this in some apps before merging?

@adamdecaf

Copy link
Copy Markdown
Member

I'll work with gosec/golangci-lint on getting the linter fix released.

@adamdecaf adamdecaf self-requested a review March 27, 2026 19:07
@adamdecaf

Copy link
Copy Markdown
Member

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances PostgreSQL and AlloyDB connection management by transitioning to pgxpool with background health checks and introducing tuned default pool settings for better failover recovery. It also adds a RetryPostgres utility and an IsRetryablePostgresError helper to handle transient connection-level errors. I have no feedback to provide.

@adamdecaf

Copy link
Copy Markdown
Member

@stevemsmith can you rebase off the latest origin/master ?

@adamdecaf adamdecaf force-pushed the fix/alloydb-failover-recovery branch from 48cbb3b to 22a1b6f Compare March 30, 2026 16:15

@InfernoJJ InfernoJJ left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the retry stuff, otherwise looks good.

Comment thread database/postgres.go Outdated
Comment thread database/postgres.go Outdated
return err
}
if attempt < maxAttempts-1 {
backoff := time.Duration(attempt+1) * 200 * time.Millisecond

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backoff is a bad idea in this case as you'll drastically increase the wait, along with it not being configurable. 200ms is an ETERNITY to a program, and is noticeable by a user. Theirs also no variance in this interface, so they will all slam at the same time.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Pushed a commit to replace the backoff with jitter in a 0–100ms window per retry.

Calling services can control patience through the context timeout. In connections for example, the queryScanConnection method can wrap its context with a 1 second deadline before the query runs. If the retry loop hasn't gotten a working connection within that second across all 3 attempts, it stops and returns an error.

The service responds with a 500, same as today, but only after trying to recover from a dead connection rather than giving up on the first failure.

Does this address the concern? We need to retry as that's dictated by the AlloyDB patching mechanism.

Comment thread database/postgres.go Outdated
Comment thread database/postgres.go Outdated
stevemsmith and others added 4 commits June 25, 2026 12:41
…oyDB

During AlloyDB maintenance switchovers (< 1s downtime), services using
database/sql with pgx fail to recover because:
1. No connection pool limits are set by default, so dead connections
   persist indefinitely
2. No retry logic exists for transient connection errors

This adds:
- DefaultPostgresConnectionsConfig() with MaxLifetime=5m, MaxIdleTime=30s
  to ensure dead connections are evicted quickly after failover
- ApplyPostgresConnectionsConfig() that fills in safe defaults when
  services don't explicitly configure pool settings
- IsRetryablePostgresError() to classify transient PG/network errors
- RetryPostgres() for services to wrap critical DB operations

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch from sql.Open("pgx") to pgxpool.NewWithConfig() wrapped with
stdlib.OpenDBFromPool(). This gives us pgxpool's HealthCheckPeriod
(set to 5s) which proactively pings idle connections and evicts dead
ones — the most important fix for surviving AlloyDB maintenance
switchovers. The return type remains *sql.DB so no downstream changes
are needed.

Also cleans up the dialer TODO (dialer lifecycle is now tied to the
pool) and removes the unused pgx.ParseConfig import path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
For sub-second AlloyDB switchovers, 1s health checks detect and evict
dead connections faster with negligible overhead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Collapse switch cases in IsRetryablePostgresError for readability
- Make context cancellation test more specific (assert context.Canceled
  and exact call count)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@stevemsmith stevemsmith force-pushed the fix/alloydb-failover-recovery branch from 22a1b6f to b56ff9e Compare June 25, 2026 18:41
… codes

Remove the 08xxx connection_exception PgError cases — pgx never surfaces
these as *pgconn.PgError, so they were dead code. Add the codes that pgx
does return and where the server guarantees a rollback before returning
the error: 40001 (serialization_failure), 40P01 (deadlock_detected),
53300 (too_many_connections), and 57014 (query_canceled).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
stevemsmith and others added 2 commits June 25, 2026 13:44
Replace the fixed 200ms×attempt linear backoff with a random delay in
[0, 100ms) per retry gap. Full jitter spreads concurrent retries across
the fleet instead of having all callers wait the same fixed interval and
slam the database simultaneously. Worst-case retry window across 3
attempts is ~200ms, sized for AlloyDB planned maintenance switchovers
which typically cause less than one second of downtime.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Drop "connection refused" (not present in pgx's codebase) and
"unexpected EOF" (already caught by errors.Is(err, io.ErrUnexpectedEOF)).
Add comments explaining why the remaining string matches are safe to retry.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@stevemsmith stevemsmith requested a review from InfernoJJ June 25, 2026 19:55
Comment thread database/postgres.go Outdated
Comment thread database/postgres.go Outdated
Comment thread database/postgres.go Outdated
cursor Bot pushed a commit to elovelan/moovio-base that referenced this pull request Jun 29, 2026
Addresses review comments r3482973525 and r3483137216 on PR moov-io#490: without
an explicit transaction there is no way to know whether a network error
occurred before or after the server accepted the query, so retrying could
duplicate committed work.

IsRetryablePostgresError now classifies only PostgreSQL SQLSTATE codes that
the server guarantees were rolled back before the error was returned
(40001, 40P01, 57P01, 57P02, 57P03, 53300, 57014). The net.OpError,
io.EOF, io.ErrUnexpectedEOF, and string-matching ("connection reset by
peer"/"broken pipe"/"conn closed") checks are removed. The
context.DeadlineExceeded check is removed as redundant (non-PgError errors
already return false). The doc comment explains why network errors are
excluded and points callers to RetryPostgresTx for transactional retry.

The network-error checks move into a new internal isPostgresNetworkError
helper, which is wired into isRetryablePostgresTxError as a 4th leg.
Network errors ARE safe to retry within an explicit transaction (the
server rolls back the uncommitted transaction), so RetryPostgresTx keeps
them. String matching is dropped entirely -- driver.ErrBadConn and
pgconn.SafeToRetry cover pgx's pre-send cases, and the typed net.OpError
/ io checks cover post-send cases, so substring matching is redundant and
fragile.

Tests updated:
- TestIsRetryablePostgresError: io.EOF, io.ErrUnexpectedEOF, net.OpError,
  and the string-matched errors now assert false; SQLSTATE assertions
  unchanged.
- TestIsRetryablePostgresTxError: added a Leg 4 section asserting
  io.EOF, io.ErrUnexpectedEOF, and net.OpError are retryable via the
  transactional classifier.

Co-authored-by: E-Love (Eric Loveland) <elovelan@users.noreply.github.com>
@stevemsmith

Copy link
Copy Markdown
Author

Can we try this in some apps before merging?

Yes, I'm running the changes in one service, will try a few others.

stevemsmith and others added 3 commits June 30, 2026 15:10
IsRetryablePostgresError and RetryPostgres are moved to a follow-up PR
where the explicit transaction / pg_xact_status approach can be designed
properly. This PR focuses solely on pool health: pgxpool with
HealthCheckPeriod=1s evicts dead connections within ~1s of an AlloyDB
switchover so the pool is healthy before the next request arrives.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Move getPostgresConnStr back to before buildAlloyDBPoolConfig to match
the pre-PR ordering and keep the diff readable.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
InfernoJJ
InfernoJJ previously approved these changes Jun 30, 2026

@elovelan elovelan left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few other tuning things that Claude recommended that I want to understand better, but I wanted to submit this because these comments are the most important to address.

Comment thread database/postgres.go Outdated
Comment thread database/database.go Outdated
Comment thread database/postgres.go

@jwbraucher jwbraucher left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverting my approval - I agree with @elovelan that the healthcheck does not include a ping()

…on detection

HealthCheckPeriod only evicts connections that have exceeded their age
thresholds — it does not ping. ShouldPing fires at acquire time and sends
an actual liveness check to the server before handing the connection to
the caller. Ping if idle > 200ms catches dead connections from an AlloyDB
switchover without adding overhead on hot connections used moments ago.

Remove ApplyPostgresConnectionsConfig and DefaultPostgresConnectionsConfig:
sql.DB setters have no effect on the underlying pgxpool, and SetMaxIdleConns
with a non-zero value actively breaks pgxpool by preventing connections from
being released back to the pool. OpenDBFromPool already sets MaxIdleConns(0)
automatically.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@stevemsmith

Copy link
Copy Markdown
Author

Removed DefaultPostgresConnectionsConfig and ApplyPostgresConnectionsConfig to keep it simpler.

Connection lifetime and pool size tuning are out of scope for this PR, the goal is dead connection detection, not pool rotation policy. These can be revisited in a follow-up if needed.

jwbraucher and others added 2 commits July 2, 2026 11:59
…patch

Preserve ability to tune (some of) the connection config

@elovelan elovelan left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants