fix(seeders): cap per-seeder wall-clock time in run-seeders.sh by guhyun9454 · Pull Request #4342 · koala73/worldmonitor

guhyun9454 · 2026-06-18T14:39:59Z

Problem

scripts/run-seeders.sh runs all ~149 seeders sequentially with no per-seeder time bound. The only ceiling is whatever wraps the script (a systemd timer / cron job). When a single upstream hangs, that one seeder consumes the rest of the window and every seeder alphabetically after it is silently dropped until the next trigger.

This is not hypothetical. On a self-hosted deployment running the script every 3h under a 2h systemd TimeoutStartSec, two consecutive cycles were killed mid-run at seed-trade-flows.mjs, so the entire t–z tail (ucdp-events, unrest-events, usa-spending, wb-indicators, weather-alerts, webcams, yield-curve-eu, …) went stale for hours.

Root cause measured

seed-climate-ocean-ice.mjs normally completes in ~4s:

{"event":"seed_complete","domain":"climate","recordCount":7,"durationMs":3680,...}
=== Done (3680ms) ===

…but intermittently balloons to ~60–68min (measured across multiple cycles) when its NOAA/NSIDC upstreams are slow — the node process stays alive far past the per-fetch AbortSignal budget. That single seeder is the dominant variance and the direct cause of the blown job timeout. For contrast, the slowest legitimate seeder observed (seed-gas-storage-countries.mjs) tops out around 18min of real work.

Fix

Wrap each seeder invocation in timeout(1):

Default SEED_TIMEOUT=1800 (30min) — comfortably above the slowest legitimate seeder (~18min), well below the pathological hangs (60min+), so it only kills runaway processes.
Overridable via SEED_TIMEOUT=<seconds>; SEED_TIMEOUT=0 disables it entirely.
-k 30 sends SIGKILL 30s after SIGTERM for seeders that ignore the term signal.
Timeouts are surfaced as a distinct TIMEOUT state in the per-seeder line and the final summary (Done: X ok, Y skipped, Z failed, W timed out), so they're not silently miscounted as failures.
Gracefully falls back to no wrapping when timeout is not on PATH.

This bounds the blast radius of any one slow/hung upstream so the rest of the run — especially the alphabetical tail — still gets its turn within the wrapping job's budget.

Testing

sh -n scripts/run-seeders.sh — passes (script is #!/bin/sh, POSIX).
Verified timeout exit-code handling against fixture processes:
- fast process → OK
- hung process (honours SIGTERM) → terminated at the cap, exit 124 → TIMEOUT
- process that ignores SIGTERM → SIGKILLed after the -k grace, exit 137 → TIMEOUT
- SEED_TIMEOUT=0 → no wrapping, runs to completion
Confirmed real seeders still pass through unaffected under the cap (seed-climate-ocean-ice.mjs, seed-ucdp-events.mjs).

Shell-only change to a self-hosting helper script; no TypeScript / build surface touched.

The host-side seeders run sequentially, so a single upstream that hangs (observed: a slow NOAA/NSIDC fetch in seed-climate-ocean-ice that keeps the node process alive ~60min vs its normal ~4s) burns the rest of the window and starves every seeder after it. Under a wrapping systemd/cron job timeout the whole run is killed mid-alphabet, dropping every later seeder until the next trigger. Wrap each seeder in `timeout` (default 1800s, overridable via SEED_TIMEOUT, 0 to disable). The default sits well above the slowest legitimate seeder observed (~18min) and far below the pathological hangs (60min+), so it only kills runaway processes. timeouts are reported as a distinct TIMEOUT state in the per-seeder log and the final summary. Falls back to no wrapping when `timeout` is unavailable.

vercel · 2026-06-18T14:40:03Z

@guhyun9454 is attempting to deploy a commit to the World Monitor Team on Vercel.

A member of the Team first needs to authorize it.

greptile-apps · 2026-06-18T14:47:07Z

Greptile Summary

Adds a per-seeder wall-clock cap to scripts/run-seeders.sh using timeout(1) with a configurable default (1800 s) and a SIGKILL grace period (-k 30), preventing a single hung seeder from consuming the entire wrapping job window and silently starving every alphabetically-later seeder.

run_seed() wraps each node seed-*.mjs invocation in timeout -k 30 \"$SEED_TIMEOUT\", falling back to unwrapped execution when timeout is not on PATH or SEED_TIMEOUT=0.
Exit codes 124 and 137 are surfaced as a dedicated TIMEOUT state (with a new timedout counter) distinct from FAIL in both the per-seeder line and the final summary.

Confidence Score: 4/5

The change is confined to a single shell helper with no impact on the TypeScript build surface; the core timeout-wrapping logic is correct for the target Linux/systemd environment.

The timeout logic correctly handles the primary hang scenario and the SEED_TIMEOUT=0 disable path, but the unconditional rc=137 check can misclassify an OOM-killed seeder as TIMEOUT even when the timeout wrapper was never invoked — most visibly when operators set SEED_TIMEOUT=0 and a seeder is externally killed, producing a misleading 'TIMEOUT (killed after 0s)' line.

scripts/run-seeders.sh — the rc=137 guard on line 86

Important Files Changed

Filename	Overview
scripts/run-seeders.sh	Adds per-seeder wall-clock timeout via timeout(1) with SIGKILL fallback; exit code 137 is gated unconditionally and would misclassify OOM-killed seeders as TIMEOUT when SEED_TIMEOUT=0

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[for f in seed-*.mjs] --> B[run_seed f]
    B --> C{timeout available\nAND SEED_TIMEOUT > 0?}
    C -- Yes --> D["timeout -k 30 $SEED_TIMEOUT node $f"]
    C -- No --> E[node $f]
    D --> F[capture output, rc]
    E --> F
    F --> G{rc == 124\nOR rc == 137?}
    G -- Yes --> H[TIMEOUT\ntimedout++]
    G -- No --> I{last line matches\nskip pattern?}
    I -- Yes --> J[SKIP\nskip++]
    I -- No --> K{rc == 0?}
    K -- Yes --> L[OK\nok++]
    K -- No --> M[FAIL\nfail++]
    H --> A
    J --> A
    L --> A
    M --> A
    A --> N["Done: ok / skipped / failed / timed out"]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[for f in seed-*.mjs] --> B[run_seed f]
    B --> C{timeout available\nAND SEED_TIMEOUT > 0?}
    C -- Yes --> D["timeout -k 30 $SEED_TIMEOUT node $f"]
    C -- No --> E[node $f]
    D --> F[capture output, rc]
    E --> F
    F --> G{rc == 124\nOR rc == 137?}
    G -- Yes --> H[TIMEOUT\ntimedout++]
    G -- No --> I{last line matches\nskip pattern?}
    I -- Yes --> J[SKIP\nskip++]
    I -- No --> K{rc == 0?}
    K -- Yes --> L[OK\nok++]
    K -- No --> M[FAIL\nfail++]
    H --> A
    J --> A
    L --> A
    M --> A
    A --> N["Done: ok / skipped / failed / timed out"]

_{Reviews (1): Last reviewed commit: "fix(seeders): cap per-seeder wall-clock ..." | Re-trigger Greptile}

greptile-apps · 2026-06-18T14:47:11Z

+  if [ "$rc" -eq 124 ] || [ "$rc" -eq 137 ]; then
+    printf "TIMEOUT (killed after %ss)\n" "$SEED_TIMEOUT"
+    timedout=$((timedout + 1))


rc=137 matches non-timeout SIGKILL sources too

Exit code 137 (128+9) is the generic shell representation of any SIGKILL, not only the one sent by timeout -k. The OOM killer, a container runtime enforcing a memory limit, or a manual kill -9 would also produce rc=137. The critical case is SEED_TIMEOUT=0: when timeouts are explicitly disabled, run_seed takes the else branch and invokes node directly, but the outer rc check still fires unconditionally — so an OOM-killed seeder with timeouts off would print TIMEOUT (killed after 0s) and increment timedout instead of being counted as a failure. Gate the rc=137 branch on whether the timeout wrapper was actually active.

Suggested change

if [ "$rc" -eq 124 ] || [ "$rc" -eq 137 ]; then

printf "TIMEOUT (killed after %ss)\n" "$SEED_TIMEOUT"

timedout=$((timedout + 1))

if [ "$rc" -eq 124 ] || { [ "$SEED_TIMEOUT" -gt 0 ] 2>/dev/null && [ "$rc" -eq 137 ]; }; then

guhyun9454 · 2026-06-18T15:23:22Z

Root-cause investigation — why an external process timeout is the right layer

I traced why seed-climate-ocean-ice runs ~4s normally but balloons to ~60min, since that determines whether the fix belongs in-process or around the process. Short version: the stall lives in a phase that no in-process timeout can cancel, so an external timeout(1) (this PR) is the only reliable backstop. A complementary in-process mitigation for the one seeder is in #4343.

The seeder's own timeouts can't bound it

seed-climate-ocean-ice caps every fetch with AbortSignal.timeout (20s, 90s for NSIDC) and runSeed wraps the fetcher in withRetry (≤4 attempts). If those caps worked, the worst case would be ~6min (4 × 90s + backoff). It hit ~60min — 10× — because AbortSignal only cancels the connect/read phase, not a hung DNS lookup. Node's fetch resolves hosts via getaddrinfo, which runs on libuv's threadpool and is uncancellable.

Evidence (reproductions on the affected host)

AbortSignal bounds a socket connect-hang, but is ignored by a DNS stall:

# fetch() to a blackhole IP (no DNS), AbortSignal.timeout(3000):
[connect-hang IP 10.255.255.1] 3004ms -> TimeoutError: aborted due to timeout   # OK

# DNS resolve to a dead resolver, AbortSignal.timeout(3000):
[abort-only]  22086ms  -> ETIMEOUT     # AbortSignal IGNORED — settles on the resolver's own timeout
[no-signal]   22071ms  -> ETIMEOUT     # one dead-resolver lookup ≈ 22s

withRetry amplification is itself bounded when abort works (4 attempts × 3s timeout + 1/2/4s backoff ≈ 19s), confirming the extra 50+min comes purely from the uncancellable DNS phase, re-triggered across 4 source hosts and 4 retries:

[withRetry(3) x allSettled(4 blackhole) @3s each] 19048ms -> All upstreams failed

(Host resolver here is the systemd-resolved stub at 127.0.0.53; when it flaps, every getaddrinfo for the NOAA/NSIDC/NASA hosts stalls and the 4-thread libuv pool serializes the overflow.)

Why this PR is the correct layer

Because the stall is in getaddrinfo — uncancellable from JS — AbortSignal, Promise.race, and withRetry tuning can all be defeated by it. Killing the process from outside (this PR's timeout -k) is the only guaranteed bound, which is exactly why a single hung seeder was able to consume the whole 2h window and starve the alphabetical tail. #4343 additionally races each fetch against a wall-clock deadline inside the worst offender so it self-recovers before the external cap is needed, but the external cap remains the catch-all for any seeder with the same blind spot.

vercel · 2026-06-23T06:12:37Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
worldmonitor	Ready	Preview, Comment	Jun 23, 2026 6:16am

The SEED_TIMEOUT=1800s (30min) default is below the legitimate runtime of bundle seeders: run-seeders.sh's seed-*.mjs glob includes seed-bundle-*.mjs, and _bundle-runner.mjs runs sections sequentially — resilience-recovery's Import-HHI section alone budgets 30min (timeoutMs 1_800_000), and the bundle's sections can sum well past that. Wrapping a bundle in the outer cap would SIGKILL it mid-run, drop its remaining sections, and orphan the in-flight section child — re-creating the exact tail-starvation this PR set out to prevent. Bundles don't need the outer cap: _bundle-runner.mjs already hard-caps every section with its own wall-clock timer (SIGTERM→SIGKILL on the child PID, immune to the DNS-hang blind spot). Exempt seed-bundle-*.mjs from the wrapper; standalone seeders (the actual target) keep it. Also resolve the timeout-usable check once (was re-evaluated per seeder) and document SEED_TIMEOUT in SELF_HOSTING.md. Claude-Session: https://claude.ai/code/session_01DhVRmt455xAAoHWifVurgs

github-actions Bot added the trust:safe Brin: contributor trust score safe label Jun 18, 2026

greptile-apps Bot reviewed Jun 18, 2026

View reviewed changes

guhyun9454 mentioned this pull request Jun 18, 2026

fix(seed-climate-ocean-ice): bound fetches with a wall-clock deadline so DNS hangs can't stall the run #4343

Merged

Merge branch 'main' into fix/seeder-per-seed-timeout

4794ad4

koala73 added High Value Meaningful contribution to the project Ready to Merge PR is mergeable, passes checks, and adds value labels Jun 23, 2026

vercel Bot deployed to Preview June 23, 2026 06:16 View deployment

koala73 added 2 commits June 23, 2026 10:40

fix(seed): classify timeouts only when timeout wrapper runs

88d2ec1

koala73 merged commit 128340e into koala73:main Jun 23, 2026
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(seeders): cap per-seeder wall-clock time in run-seeders.sh#4342

fix(seeders): cap per-seeder wall-clock time in run-seeders.sh#4342
koala73 merged 4 commits into
koala73:mainfrom
guhyun9454:fix/seeder-per-seed-timeout

guhyun9454 commented Jun 18, 2026

Uh oh!

vercel Bot commented Jun 18, 2026

Uh oh!

greptile-apps Bot commented Jun 18, 2026

Uh oh!

greptile-apps Bot Jun 18, 2026

Uh oh!

guhyun9454 commented Jun 18, 2026

Uh oh!

vercel Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

guhyun9454 commented Jun 18, 2026

Problem

Root cause measured

Fix

Testing

Uh oh!

vercel Bot commented Jun 18, 2026

Uh oh!

greptile-apps Bot commented Jun 18, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

guhyun9454 commented Jun 18, 2026

Root-cause investigation — why an external process timeout is the right layer

The seeder's own timeouts can't bound it

Evidence (reproductions on the affected host)

Why this PR is the correct layer

Uh oh!

vercel Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented Jun 23, 2026 •

edited

Loading