fix(cloud-agent): recover sessions after runtime starvation#3908
fix(cloud-agent): recover sessions after runtime starvation#3908eshurakov wants to merge 1 commit into
Conversation
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Executive SummaryThis PR introduces well-structured runtime starvation recovery across four coordinated layers (Bash event coalescing, nice-10 Kilo launch shim, Files Reviewed (17 files)
Notable Design ObservationsCoalescer memory/timer safety: Shell injection safety: All dynamic values in Fail-closed on repair scheduling: Warm reuse guard: The updated AWK field extraction: The Fix these issues in Kilo Cloud Reviewed by claude-4.6-sonnet-20260217 · 2,201,760 tokens Review guidance: REVIEW.md from base branch |
Summary
Why
Long package installs, formatting, linting, and test commands can starve Kilo inside the shared sandbox while producing thousands of cumulative Bash updates. Observed failures left orphaned Kilo runtimes behind and could exhaust a follow-up message's delivery retries without ever terminalizing it. Recovery therefore has to coordinate durable message state with the physical wrapper/Kilo lifecycle without replaying a turn that may already have changed the workspace.
What was done
/proc, and revalidate exact ownership before signalling direct or devcontainer PIDs.terminalization-pendingwork so alarm replay can finish failure settlement without redispatching the turn; retain guarded legacy ownership matching for rolling deployments.High-level architecture
sequenceDiagram participant Caller participant DO as Cloud Agent Durable Object participant Sandbox participant Wrapper participant Kilo Caller->>DO: Admit durable message DO->>Sandbox: Prepare or reuse leased runtime Sandbox->>Wrapper: Start with instance and generation Wrapper->>Kilo: Launch at nice 10 with inherited owner Kilo-->>Wrapper: Running Bash snapshots Wrapper-->>DO: Leading, latest, and terminal events opt Delivery exhausts DO->>DO: Schedule repair alarm DO->>DO: Persist terminalization-pending DO->>DO: Terminalize failed without redispatch end opt Runtime cleanup is required DO->>Sandbox: Inspect wrapper and owned descendants Sandbox->>Sandbox: Revalidate owner and signal exact PIDs Sandbox-->>DO: Report absent after re-observation endArchitecture decision
Decision: Recover each failure at its owning lifecycle boundary: the Durable Object guarantees settlement repair, the wrapper limits event amplification, and the sandbox lifecycle owns physical descendant cleanup through the existing lease identity.
Context: Durable message state outlives wrapper processes, Kilo runs as a separate process whose commands can survive wrapper exit, and cumulative Bash updates create pressure before events reach the Durable Object.
Rationale: Scheduling repair before terminal disposition removes the stranded-state window; inherited lease ownership remains observable after wrapper death; and coalescing before WebSocket transmission reduces load at the earliest controllable boundary.
Alternatives considered:
.kilo serveprocesses and command descendants can outlive the wrapper and poison subsequent recovery.Consequences: Exhausted work can settle durably without redispatch, and cleanup no longer reports absence while an observed owned runtime remains. The trade-off is Linux
/procand shell-tool dependence, intentional loss of intermediate cumulative snapshots, and advisory CPU prioritization rather than hard memory or CPU isolation.Verification
kilo-realand.kilo serveran at nice 10 with the same runtime-owner marker.Visual Changes
N/A
Reviewer Notes
/procownership parsing and the guarded pre-KILO_RUNTIME_OWNERfallback; inspection failures intentionally block reuse and cleanup completion.