fix(seeders): cap per-seeder wall-clock time in run-seeders.sh#4342
Conversation
The host-side seeders run sequentially, so a single upstream that hangs (observed: a slow NOAA/NSIDC fetch in seed-climate-ocean-ice that keeps the node process alive ~60min vs its normal ~4s) burns the rest of the window and starves every seeder after it. Under a wrapping systemd/cron job timeout the whole run is killed mid-alphabet, dropping every later seeder until the next trigger. Wrap each seeder in `timeout` (default 1800s, overridable via SEED_TIMEOUT, 0 to disable). The default sits well above the slowest legitimate seeder observed (~18min) and far below the pathological hangs (60min+), so it only kills runaway processes. timeouts are reported as a distinct TIMEOUT state in the per-seeder log and the final summary. Falls back to no wrapping when `timeout` is unavailable.
|
@guhyun9454 is attempting to deploy a commit to the World Monitor Team on Vercel. A member of the Team first needs to authorize it. |
| if [ "$rc" -eq 124 ] || [ "$rc" -eq 137 ]; then | ||
| printf "TIMEOUT (killed after %ss)\n" "$SEED_TIMEOUT" | ||
| timedout=$((timedout + 1)) |
There was a problem hiding this comment.
rc=137 matches non-timeout SIGKILL sources too
Exit code 137 (128+9) is the generic shell representation of any SIGKILL, not only the one sent by timeout -k. The OOM killer, a container runtime enforcing a memory limit, or a manual kill -9 would also produce rc=137. The critical case is SEED_TIMEOUT=0: when timeouts are explicitly disabled, run_seed takes the else branch and invokes node directly, but the outer rc check still fires unconditionally — so an OOM-killed seeder with timeouts off would print TIMEOUT (killed after 0s) and increment timedout instead of being counted as a failure. Gate the rc=137 branch on whether the timeout wrapper was actually active.
| if [ "$rc" -eq 124 ] || [ "$rc" -eq 137 ]; then | |
| printf "TIMEOUT (killed after %ss)\n" "$SEED_TIMEOUT" | |
| timedout=$((timedout + 1)) | |
| if [ "$rc" -eq 124 ] || { [ "$SEED_TIMEOUT" -gt 0 ] 2>/dev/null && [ "$rc" -eq 137 ]; }; then |
Root-cause investigation — why an external process timeout is the right layerI traced why The seeder's own timeouts can't bound it
Evidence (reproductions on the affected host)
(Host resolver here is the Why this PR is the correct layerBecause the stall is in |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
The SEED_TIMEOUT=1800s (30min) default is below the legitimate runtime of bundle seeders: run-seeders.sh's seed-*.mjs glob includes seed-bundle-*.mjs, and _bundle-runner.mjs runs sections sequentially — resilience-recovery's Import-HHI section alone budgets 30min (timeoutMs 1_800_000), and the bundle's sections can sum well past that. Wrapping a bundle in the outer cap would SIGKILL it mid-run, drop its remaining sections, and orphan the in-flight section child — re-creating the exact tail-starvation this PR set out to prevent. Bundles don't need the outer cap: _bundle-runner.mjs already hard-caps every section with its own wall-clock timer (SIGTERM→SIGKILL on the child PID, immune to the DNS-hang blind spot). Exempt seed-bundle-*.mjs from the wrapper; standalone seeders (the actual target) keep it. Also resolve the timeout-usable check once (was re-evaluated per seeder) and document SEED_TIMEOUT in SELF_HOSTING.md. Claude-Session: https://claude.ai/code/session_01DhVRmt455xAAoHWifVurgs
Problem
scripts/run-seeders.shruns all ~149 seeders sequentially with no per-seeder time bound. The only ceiling is whatever wraps the script (a systemd timer / cron job). When a single upstream hangs, that one seeder consumes the rest of the window and every seeder alphabetically after it is silently dropped until the next trigger.This is not hypothetical. On a self-hosted deployment running the script every 3h under a 2h systemd
TimeoutStartSec, two consecutive cycles were killed mid-run atseed-trade-flows.mjs, so the entiret–ztail (ucdp-events,unrest-events,usa-spending,wb-indicators,weather-alerts,webcams,yield-curve-eu, …) went stale for hours.Root cause measured
seed-climate-ocean-ice.mjsnormally completes in ~4s:…but intermittently balloons to ~60–68min (measured across multiple cycles) when its NOAA/NSIDC upstreams are slow — the node process stays alive far past the per-fetch
AbortSignalbudget. That single seeder is the dominant variance and the direct cause of the blown job timeout. For contrast, the slowest legitimate seeder observed (seed-gas-storage-countries.mjs) tops out around 18min of real work.Fix
Wrap each seeder invocation in
timeout(1):SEED_TIMEOUT=1800(30min) — comfortably above the slowest legitimate seeder (~18min), well below the pathological hangs (60min+), so it only kills runaway processes.SEED_TIMEOUT=<seconds>;SEED_TIMEOUT=0disables it entirely.-k 30sendsSIGKILL30s afterSIGTERMfor seeders that ignore the term signal.TIMEOUTstate in the per-seeder line and the final summary (Done: X ok, Y skipped, Z failed, W timed out), so they're not silently miscounted as failures.timeoutis not onPATH.This bounds the blast radius of any one slow/hung upstream so the rest of the run — especially the alphabetical tail — still gets its turn within the wrapping job's budget.
Testing
sh -n scripts/run-seeders.sh— passes (script is#!/bin/sh, POSIX).timeoutexit-code handling against fixture processes:OKTIMEOUTSIGKILLed after the-kgrace, exit 137 →TIMEOUTSEED_TIMEOUT=0→ no wrapping, runs to completionseed-climate-ocean-ice.mjs,seed-ucdp-events.mjs).Shell-only change to a self-hosting helper script; no TypeScript / build surface touched.