Skip to content

fix(seeders): cap per-seeder wall-clock time in run-seeders.sh#4342

Merged
koala73 merged 4 commits into
koala73:mainfrom
guhyun9454:fix/seeder-per-seed-timeout
Jun 23, 2026
Merged

fix(seeders): cap per-seeder wall-clock time in run-seeders.sh#4342
koala73 merged 4 commits into
koala73:mainfrom
guhyun9454:fix/seeder-per-seed-timeout

Conversation

@guhyun9454

Copy link
Copy Markdown
Contributor

Problem

scripts/run-seeders.sh runs all ~149 seeders sequentially with no per-seeder time bound. The only ceiling is whatever wraps the script (a systemd timer / cron job). When a single upstream hangs, that one seeder consumes the rest of the window and every seeder alphabetically after it is silently dropped until the next trigger.

This is not hypothetical. On a self-hosted deployment running the script every 3h under a 2h systemd TimeoutStartSec, two consecutive cycles were killed mid-run at seed-trade-flows.mjs, so the entire tz tail (ucdp-events, unrest-events, usa-spending, wb-indicators, weather-alerts, webcams, yield-curve-eu, …) went stale for hours.

Root cause measured

seed-climate-ocean-ice.mjs normally completes in ~4s:

{"event":"seed_complete","domain":"climate","recordCount":7,"durationMs":3680,...}
=== Done (3680ms) ===

…but intermittently balloons to ~60–68min (measured across multiple cycles) when its NOAA/NSIDC upstreams are slow — the node process stays alive far past the per-fetch AbortSignal budget. That single seeder is the dominant variance and the direct cause of the blown job timeout. For contrast, the slowest legitimate seeder observed (seed-gas-storage-countries.mjs) tops out around 18min of real work.

Fix

Wrap each seeder invocation in timeout(1):

  • Default SEED_TIMEOUT=1800 (30min) — comfortably above the slowest legitimate seeder (~18min), well below the pathological hangs (60min+), so it only kills runaway processes.
  • Overridable via SEED_TIMEOUT=<seconds>; SEED_TIMEOUT=0 disables it entirely.
  • -k 30 sends SIGKILL 30s after SIGTERM for seeders that ignore the term signal.
  • Timeouts are surfaced as a distinct TIMEOUT state in the per-seeder line and the final summary (Done: X ok, Y skipped, Z failed, W timed out), so they're not silently miscounted as failures.
  • Gracefully falls back to no wrapping when timeout is not on PATH.

This bounds the blast radius of any one slow/hung upstream so the rest of the run — especially the alphabetical tail — still gets its turn within the wrapping job's budget.

Testing

  • sh -n scripts/run-seeders.sh — passes (script is #!/bin/sh, POSIX).
  • Verified timeout exit-code handling against fixture processes:
    • fast process → OK
    • hung process (honours SIGTERM) → terminated at the cap, exit 124 → TIMEOUT
    • process that ignores SIGTERM → SIGKILLed after the -k grace, exit 137 → TIMEOUT
    • SEED_TIMEOUT=0 → no wrapping, runs to completion
  • Confirmed real seeders still pass through unaffected under the cap (seed-climate-ocean-ice.mjs, seed-ucdp-events.mjs).

Shell-only change to a self-hosting helper script; no TypeScript / build surface touched.

The host-side seeders run sequentially, so a single upstream that hangs
(observed: a slow NOAA/NSIDC fetch in seed-climate-ocean-ice that keeps the
node process alive ~60min vs its normal ~4s) burns the rest of the window
and starves every seeder after it. Under a wrapping systemd/cron job
timeout the whole run is killed mid-alphabet, dropping every later seeder
until the next trigger.

Wrap each seeder in `timeout` (default 1800s, overridable via SEED_TIMEOUT,
0 to disable). The default sits well above the slowest legitimate seeder
observed (~18min) and far below the pathological hangs (60min+), so it only
kills runaway processes. timeouts are reported as a distinct TIMEOUT state
in the per-seeder log and the final summary. Falls back to no wrapping when
`timeout` is unavailable.
@vercel

vercel Bot commented Jun 18, 2026

Copy link
Copy Markdown

@guhyun9454 is attempting to deploy a commit to the World Monitor Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions github-actions Bot added the trust:safe Brin: contributor trust score safe label Jun 18, 2026
@greptile-apps

greptile-apps Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Adds a per-seeder wall-clock cap to scripts/run-seeders.sh using timeout(1) with a configurable default (1800 s) and a SIGKILL grace period (-k 30), preventing a single hung seeder from consuming the entire wrapping job window and silently starving every alphabetically-later seeder.

  • run_seed() wraps each node seed-*.mjs invocation in timeout -k 30 \"$SEED_TIMEOUT\", falling back to unwrapped execution when timeout is not on PATH or SEED_TIMEOUT=0.
  • Exit codes 124 and 137 are surfaced as a dedicated TIMEOUT state (with a new timedout counter) distinct from FAIL in both the per-seeder line and the final summary.

Confidence Score: 4/5

The change is confined to a single shell helper with no impact on the TypeScript build surface; the core timeout-wrapping logic is correct for the target Linux/systemd environment.

The timeout logic correctly handles the primary hang scenario and the SEED_TIMEOUT=0 disable path, but the unconditional rc=137 check can misclassify an OOM-killed seeder as TIMEOUT even when the timeout wrapper was never invoked — most visibly when operators set SEED_TIMEOUT=0 and a seeder is externally killed, producing a misleading 'TIMEOUT (killed after 0s)' line.

scripts/run-seeders.sh — the rc=137 guard on line 86

Important Files Changed

Filename Overview
scripts/run-seeders.sh Adds per-seeder wall-clock timeout via timeout(1) with SIGKILL fallback; exit code 137 is gated unconditionally and would misclassify OOM-killed seeders as TIMEOUT when SEED_TIMEOUT=0

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[for f in seed-*.mjs] --> B[run_seed f]
    B --> C{timeout available\nAND SEED_TIMEOUT > 0?}
    C -- Yes --> D["timeout -k 30 $SEED_TIMEOUT node $f"]
    C -- No --> E[node $f]
    D --> F[capture output, rc]
    E --> F
    F --> G{rc == 124\nOR rc == 137?}
    G -- Yes --> H[TIMEOUT\ntimedout++]
    G -- No --> I{last line matches\nskip pattern?}
    I -- Yes --> J[SKIP\nskip++]
    I -- No --> K{rc == 0?}
    K -- Yes --> L[OK\nok++]
    K -- No --> M[FAIL\nfail++]
    H --> A
    J --> A
    L --> A
    M --> A
    A --> N["Done: ok / skipped / failed / timed out"]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[for f in seed-*.mjs] --> B[run_seed f]
    B --> C{timeout available\nAND SEED_TIMEOUT > 0?}
    C -- Yes --> D["timeout -k 30 $SEED_TIMEOUT node $f"]
    C -- No --> E[node $f]
    D --> F[capture output, rc]
    E --> F
    F --> G{rc == 124\nOR rc == 137?}
    G -- Yes --> H[TIMEOUT\ntimedout++]
    G -- No --> I{last line matches\nskip pattern?}
    I -- Yes --> J[SKIP\nskip++]
    I -- No --> K{rc == 0?}
    K -- Yes --> L[OK\nok++]
    K -- No --> M[FAIL\nfail++]
    H --> A
    J --> A
    L --> A
    M --> A
    A --> N["Done: ok / skipped / failed / timed out"]
Loading

Reviews (1): Last reviewed commit: "fix(seeders): cap per-seeder wall-clock ..." | Re-trigger Greptile

Comment thread scripts/run-seeders.sh Outdated
Comment on lines +86 to +88
if [ "$rc" -eq 124 ] || [ "$rc" -eq 137 ]; then
printf "TIMEOUT (killed after %ss)\n" "$SEED_TIMEOUT"
timedout=$((timedout + 1))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 rc=137 matches non-timeout SIGKILL sources too

Exit code 137 (128+9) is the generic shell representation of any SIGKILL, not only the one sent by timeout -k. The OOM killer, a container runtime enforcing a memory limit, or a manual kill -9 would also produce rc=137. The critical case is SEED_TIMEOUT=0: when timeouts are explicitly disabled, run_seed takes the else branch and invokes node directly, but the outer rc check still fires unconditionally — so an OOM-killed seeder with timeouts off would print TIMEOUT (killed after 0s) and increment timedout instead of being counted as a failure. Gate the rc=137 branch on whether the timeout wrapper was actually active.

Suggested change
if [ "$rc" -eq 124 ] || [ "$rc" -eq 137 ]; then
printf "TIMEOUT (killed after %ss)\n" "$SEED_TIMEOUT"
timedout=$((timedout + 1))
if [ "$rc" -eq 124 ] || { [ "$SEED_TIMEOUT" -gt 0 ] 2>/dev/null && [ "$rc" -eq 137 ]; }; then

@guhyun9454

Copy link
Copy Markdown
Contributor Author

Root-cause investigation — why an external process timeout is the right layer

I traced why seed-climate-ocean-ice runs ~4s normally but balloons to ~60min, since that determines whether the fix belongs in-process or around the process. Short version: the stall lives in a phase that no in-process timeout can cancel, so an external timeout(1) (this PR) is the only reliable backstop. A complementary in-process mitigation for the one seeder is in #4343.

The seeder's own timeouts can't bound it

seed-climate-ocean-ice caps every fetch with AbortSignal.timeout (20s, 90s for NSIDC) and runSeed wraps the fetcher in withRetry (≤4 attempts). If those caps worked, the worst case would be ~6min (4 × 90s + backoff). It hit ~60min — 10× — because AbortSignal only cancels the connect/read phase, not a hung DNS lookup. Node's fetch resolves hosts via getaddrinfo, which runs on libuv's threadpool and is uncancellable.

Evidence (reproductions on the affected host)

AbortSignal bounds a socket connect-hang, but is ignored by a DNS stall:

# fetch() to a blackhole IP (no DNS), AbortSignal.timeout(3000):
[connect-hang IP 10.255.255.1] 3004ms -> TimeoutError: aborted due to timeout   # OK

# DNS resolve to a dead resolver, AbortSignal.timeout(3000):
[abort-only]  22086ms  -> ETIMEOUT     # AbortSignal IGNORED — settles on the resolver's own timeout
[no-signal]   22071ms  -> ETIMEOUT     # one dead-resolver lookup ≈ 22s

withRetry amplification is itself bounded when abort works (4 attempts × 3s timeout + 1/2/4s backoff ≈ 19s), confirming the extra 50+min comes purely from the uncancellable DNS phase, re-triggered across 4 source hosts and 4 retries:

[withRetry(3) x allSettled(4 blackhole) @3s each] 19048ms -> All upstreams failed

(Host resolver here is the systemd-resolved stub at 127.0.0.53; when it flaps, every getaddrinfo for the NOAA/NSIDC/NASA hosts stalls and the 4-thread libuv pool serializes the overflow.)

Why this PR is the correct layer

Because the stall is in getaddrinfo — uncancellable from JS — AbortSignal, Promise.race, and withRetry tuning can all be defeated by it. Killing the process from outside (this PR's timeout -k) is the only guaranteed bound, which is exactly why a single hung seeder was able to consume the whole 2h window and starve the alphabetical tail. #4343 additionally races each fetch against a wall-clock deadline inside the worst offender so it self-recovers before the external cap is needed, but the external cap remains the catch-all for any seeder with the same blind spot.

@koala73 koala73 added High Value Meaningful contribution to the project Ready to Merge PR is mergeable, passes checks, and adds value labels Jun 23, 2026
@vercel

vercel Bot commented Jun 23, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
worldmonitor Ready Ready Preview, Comment Jun 23, 2026 6:16am

Request Review

koala73 added 2 commits June 23, 2026 10:40
The SEED_TIMEOUT=1800s (30min) default is below the legitimate runtime of bundle
seeders: run-seeders.sh's seed-*.mjs glob includes seed-bundle-*.mjs, and
_bundle-runner.mjs runs sections sequentially — resilience-recovery's Import-HHI
section alone budgets 30min (timeoutMs 1_800_000), and the bundle's sections can
sum well past that. Wrapping a bundle in the outer cap would SIGKILL it mid-run,
drop its remaining sections, and orphan the in-flight section child — re-creating
the exact tail-starvation this PR set out to prevent.

Bundles don't need the outer cap: _bundle-runner.mjs already hard-caps every
section with its own wall-clock timer (SIGTERM→SIGKILL on the child PID, immune to
the DNS-hang blind spot). Exempt seed-bundle-*.mjs from the wrapper; standalone
seeders (the actual target) keep it. Also resolve the timeout-usable check once
(was re-evaluated per seeder) and document SEED_TIMEOUT in SELF_HOSTING.md.

Claude-Session: https://claude.ai/code/session_01DhVRmt455xAAoHWifVurgs
@koala73 koala73 merged commit 128340e into koala73:main Jun 23, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

High Value Meaningful contribution to the project Ready to Merge PR is mergeable, passes checks, and adds value trust:safe Brin: contributor trust score safe

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants