Skip to content

[superlog] Add queryTimeoutMs to cacheable() to prevent indefinite DB hangs#489

Open
superlog-app[bot] wants to merge 1 commit into
stagingfrom
superlog/add-query-timeout-to-cacheable
Open

[superlog] Add queryTimeoutMs to cacheable() to prevent indefinite DB hangs#489
superlog-app[bot] wants to merge 1 commit into
stagingfrom
superlog/add-query-timeout-to-cacheable

Conversation

@superlog-app

@superlog-app superlog-app Bot commented Jun 18, 2026

Copy link
Copy Markdown

Summary

The basket /batch endpoint returned after ~943 seconds (nearly 16 minutes) for a single request. The cacheable() wrapper in packages/redis has a 2-second timeout for Redis operations but no timeout for the underlying function call. When the PostgreSQL websites table lookup stalled, the request — and all concurrent requests for the same website ID deduplicated by the inflightRequests map — were frozen for the entire duration of the DB stall.

The cacheable() wrapper was extended with an optional queryTimeoutMs option. When set, the underlying fn() call is wrapped with withTimeout(..., queryTimeoutMs, "Query timeout") on all execution paths: cache miss, Redis-unavailable bypass, and background stale-while-revalidate. getWebsiteByIdWithOwnerCached in apps/basket/src/hooks/auth.ts is set to queryTimeoutMs: 10_000 (10 seconds), matching the p99 expectation for a simple PK lookup.

An alternative remediation would be to set statement_timeout on the PostgreSQL connection pool so the database itself enforces the limit — this would also protect all other Drizzle queries that don't go through cacheable. Both approaches are complementary and can be combined.

Incident on Superlog


Was this PR helpful? Leave feedback — goes straight to the Superlog team.


Summary by cubic

Add a per-call timeout to cacheable() to prevent long DB stalls from freezing requests. Applied a 10s timeout to Basket’s website lookup so requests fail fast instead of hanging.

  • New Features
    • Added queryTimeoutMs to packages/redis cacheable(), wrapping the underlying function with withTimeout(..., "Query timeout") on cache miss, Redis-bypass, and background stale-while-revalidate.
    • Cleans up inflightRequests on timeout; concurrent callers share the timed promise and receive the timeout error.
    • Tests cover timeout behavior, concurrency, and Redis-unavailable path.
    • Set queryTimeoutMs: 10_000 for getWebsiteByIdWithOwnerCached in apps/basket/src/hooks/auth.ts.

Written for commit 24c9674. Summary will update on new commits.

Review in cubic

@vercel

vercel Bot commented Jun 18, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
dashboard Ready Ready Preview, Comment Jun 18, 2026 8:14pm
databuddy-status Ready Ready Preview, Comment Jun 18, 2026 8:14pm
documentation Ready Ready Preview, Comment Jun 18, 2026 8:14pm

@unkey-deploy

unkey-deploy Bot commented Jun 18, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Unkey Deploy

Name Status Preview Inspect Updated (UTC)
api (preview) Ready Visit Preview Inspect Jun 18, 2026 8:13pm

@blacksmith-sh

blacksmith-sh Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Found 7 test failures on Blacksmith runners:

Failures

Test View Logs
[chromium] › test/e2e/specs/regressions/
api-keys.spec.ts:4:5 › creates and deletes an API key without leaving confirmation dial
ogs open @regression @core
View Logs
[chromium] › test/e2e/specs/regressions/
website-analytics.spec.ts:31:5 › shows seeded analytics data and applies a topbar filte
r @regression @core
View Logs
[chromium] › test/e2e/specs/regressions/
website-analytics.spec.ts:7:5 › renders analytics controls in the website topbar @regre
ssion @core
View Logs
[chromium] › test/e2e/specs/smoke/
account.spec.ts:3:5 › updates the signed-in user's profile name @smoke
View Logs
[chromium] › test/e2e/specs/smoke/
auth.spec.ts:12:5 › boots an authenticated browser session @smoke
View Logs
[chromium] › test/e2e/specs/smoke/
auth.spec.ts:30:5 › signs out and protects authenticated routes @smoke @core
View Logs
[chromium] › test/e2e/specs/smoke/
auth.spec.ts:3:5 › redirects unauthenticated visitors to sign in @smoke
View Logs

Fix in Cursor

@greptile-apps

greptile-apps Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds an optional queryTimeoutMs parameter to the cacheable() wrapper to bound how long the underlying function (e.g. a DB query) is allowed to run before rejecting, directly addressing a ~943-second /batch endpoint hang caused by a PostgreSQL stall on the websites table. The timeout is threaded through all three execution paths — Redis bypass, Redis-error fallback, and the main cache-miss inflight path — and is set to 10 seconds on getWebsiteByIdWithOwnerCached.

  • withTimeout is extended with a message parameter so query timeouts produce a distinct \"Query timeout\" error separate from the existing \"Redis timeout\", keeping observability clean.
  • inflightRequests cleanup on timeout is handled correctly by the existing finally block, and concurrent callers awaiting the same inflight promise all receive the timeout rejection together (verified by the new tests).
  • The rawPromise continues executing after the JavaScript-side timeout fires (DB connection is still held); the PR description correctly identifies statement_timeout on the connection pool as a complementary, deeper fix.

Confidence Score: 4/5

The core fix is correct and safe to merge. The cacheable change is well-tested across all main execution paths, and the 10-second timeout on getWebsiteByIdWithOwnerCached directly prevents recurrence of the incident.

The resolveApiKeyOwnerId cacheable in auth.ts issues the same class of DB lookup but is left without a timeout, leaving that specific inflight-deduplication path still vulnerable to an indefinite hang. This is a gap worth closing soon, though it doesn't block this fix from shipping.

apps/basket/src/hooks/auth.ts — resolveApiKeyOwnerId should be reviewed for a queryTimeoutMs value consistent with the one added to getWebsiteByIdWithOwnerCached.

Important Files Changed

Filename Overview
packages/redis/cacheable.ts Adds optional queryTimeoutMs to CacheOptions and threads it through all three execution paths (Redis bypass, Redis error fallback, and main cache-miss inflight path). The withTimeout helper is cleanly extended with a message parameter. inflightRequests cleanup on timeout is handled correctly via the existing finally block.
packages/redis/cacheable.test.ts Adds a well-structured query timeout describe block with five cases: basic rejection, inflight cleanup + retry, concurrent-caller propagation, fast-function no-op, and Redis-unavailable bypass path. The staleWhileRevalidate background revalidation timeout path is not tested.
apps/basket/src/hooks/auth.ts Adds queryTimeoutMs: 10_000 to getWebsiteByIdWithOwnerCached — directly addressing the incident. The sibling resolveApiKeyOwnerId cacheable, which also issues a DB query, is left without a timeout.

Comments Outside Diff (1)

  1. apps/basket/src/hooks/auth.ts, line 58-67 (link)

    P2 resolveApiKeyOwnerId — same DB hang exposure as the incident, but unprotected

    resolveApiKeyOwnerId calls _resolveOwnerId, which executes db.query.member.findFirst. This is the same PostgreSQL lookup pattern that caused the ~943-second hang, but queryTimeoutMs was only added to getWebsiteByIdWithOwnerCached. If the members table stalls, all inflight callers deduplicated by inflightRequests will block indefinitely, just as the websites table did before this fix.

Reviews (1): Last reviewed commit: "[superlog] Add queryTimeoutMs to cacheab..." | Re-trigger Greptile

Comment on lines +1048 to +1068

const results = await Promise.allSettled([
cached(),
cached(),
cached(),
]);

for (const result of results) {
expect(result.status).toBe("rejected");
if (result.status === "rejected") {
expect(result.reason.message).toBe("Query timeout");
}
}
}, 5000);

it("does not time out fast functions", async () => {
const cached = cacheable(async () => "quick", {
expireInSec: 60,
prefix: "qtimeout-fast",
queryTimeoutMs: 1000,
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing test for background stale-while-revalidate timeout path

The triggerBackgroundRevalidation function now accepts queryTimeoutMs and uses withTimeout around the fn() call (line 223–225 of cacheable.ts), but the test suite has no case that exercises this path. A test with staleWhileRevalidate: true and a slow function would confirm the background fetch is bounded and that the stale value is still served to the caller (i.e. the timeout in background doesn't surface to callers).

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants