refactor: thread context throughout and enforce ContextConnectivity lint rule#3279
refactor: thread context throughout and enforce ContextConnectivity lint rule#3279dgageot wants to merge 21 commits into
Conversation
TrackSynchronous received a context.Context but discarded it, minting a detached context.Background() instead. Pass the caller's context through; behavior is unchanged (track ignores ctx today) but the context is no longer disconnected from the program's root. Found by Lint/ContextConnectivity.
…ctivity) Annotate the remaining context.Background()/TODO() call sites that are legitimate independent roots with //rubocop:disable Lint/ContextConnectivity. These fall into a few categories that cannot (and should not) derive from the program's root context: - Bubble Tea event-loop handlers (pkg/tui): Update/Init and tea.Cmd callbacks have no context parameter, and appModel deliberately does not store the root ctx — threading it would make persistence writes cancellable at shutdown. - runtime.Runtime / interface implementations with fixed signatures (remote_runtime, LocalRuntime Steer/FollowUp, accessors). - Shutdown/flush paths that must outlive the cancelled parent ctx (otel providers, HTTP server Shutdown, OAuth callback servers). - sync.Once / memoized initializers and deliberately detached subprocesses (gateway catalog, LSP process, sound, event hooks). - Test/e2e helpers (pkg/fake, e2e/binary). No behavior change.
Picks up Lint/ContextConnectivity treating context.WithoutCancel as a derivation, so cleanup/flush contexts wrapped with it are verified as connected to the program's root instead of flagged as detached.
Replace detached context.Background() in cleanup/flush paths that have a parent context in scope with context.WithoutCancel(parent). This keeps the parent's trace context (OTel spans/baggage) while still letting the work outlive the parent's cancellation — strictly better than a fresh root. Converted (and dropped their Lint/ContextConnectivity suppressions): - cmd/root/otel.go: provider flush goroutine, logger-provider error path, and shutdownTracerProvider (now takes ctx) - pkg/mcp/server.go, pkg/chatserver/server.go: graceful httpServer.Shutdown - pkg/runtime/remote_runtime.go, pkg/tools/mcp/oauth.go, pkg/tools/mcp/oauth_login.go: OAuth callback server shutdown - pkg/tools/builtin/lsp/lsp_lifecycle.go: WithoutCancel(callerCtx) now carries the trace context the code previously injected by hand Requires the rubocop-go bump that treats WithoutCancel as a derivation, so these contexts are verified as connected rather than flagged as detached.
The MCP catalog is loaded once per process via sync.Once. Pass the first caller's context into the one-shot loader and wrap it with context.WithoutCancel: on the first call this attaches the fetch to the program's context (trace context/baggage) while detaching cancellation, so a cancelled first caller still cannot poison the memoized result for everyone else. Subsequent callers' contexts are unused — the work has already run. ServerSpec now honours its ctx parameter (previously discarded), and the test override hook moves to a dedicated catalogOverride var. Drops the Lint/ContextConnectivity suppression.
Add a context.Context parameter to the Runtime interface methods that previously took none and minted a detached context.Background()/TODO() internally: Steer, FollowUp, SetCurrentAgent, and CurrentAgentName. - RemoteRuntime now uses the caller's context for its network calls (SteerSession/FollowUpSession/GetAgent) and its one-time default-agent resolution (via WithoutCancel, so a cancelled first caller can't poison the cached default). - LocalRuntime keeps a private context-free currentAgentName() for its many internal callers (event callbacks, logging) that have no context; the exported method exists to satisfy the interface. - Callers thread their existing context: app.App (TUI boundary mints a root in SwitchAgent), SessionManager, cli runner. - Threading context into SetCurrentAgent surfaced a slog.Debug that should be slog.DebugContext (caught by Lint/SlogContextual); fixed. Interface methods with fixed signatures and no caller context (e.g. TitleGenerator) and internal leaf helpers below the context boundary (getAgentModelID, startedToolNames, AgentsInfo \u2014 where the context only feeds slog correlation) keep documented suppressions.
… a parent Two more pockets that had a parent context in scope: - pkg/remote/transport.go: the memoized Docker Desktop probe ignored NewTransport's ctx and used context.Background(). Use context.WithoutCancel(ctx) so the one-time probe keeps the caller's trace context while detaching cancellation (same rationale as the gateway catalog memoization). - pkg/app/app.go: the three drain-after-cancel loops (Run, RunWithMessage, RunSkillFork) forwarded the final StreamStoppedEvent via context.Background() once ctx was cancelled. Use context.WithoutCancel(ctx) to keep the trace context for that last event. Removes three now-stale //nolint:gosec directives whose justification referenced the removed context.Background() calls. No behavior change.
stopToolSets ran shutdown cleanup with a detached context.Background() and a 30s timeout. Give it a ctx parameter and use context.WithTimeout(context.WithoutCancel(ctx), 30s): cleanup still survives the caller's cancellation (it often runs on shutdown or tab close, after ctx is done) but now keeps the caller's trace context. Every caller already has a context in scope (cobra cmd.Context, the CreateSession/runOrExec ctx, or the session spawner's spawnCtx). Drops the Lint/ContextConnectivity suppression.
Second review of Lint/ContextConnectivity suppressions found more places where a parent context was available without changing cancellation semantics. Telemetry, OAuth callback startup, shell askpass startup, LSP lazy start, attached-session lifecycle, skills loading, RAG pricing, and App/TUI child components now preserve context values instead of minting fresh roots. Suppressions drop from 92 to 27.
Follow-up review of the remaining Lint/ContextConnectivity suppressions: - Add context to RAG DocumentProcessor.Process and ProcessFile so tree-sitter parsing uses the caller's cancellation/trace context instead of a fresh root. - Thread context through runtime/team metadata helpers for model/tool lookup paths where the context is now available (AgentConfigInfo, AgentsInfo, getEffectiveModelID, agentModelLabel, startedToolNames). - Update affected tests and contextual logging in tree-sitter processing. This drops remaining suppressions from 27 to 23; the rest are structural roots (DB bootstrap, detached hooks/processes, self-owned callbacks, and test support).
Address review findings from the context-connectivity refactor: - Keep VectorStoreConfig backward-compatible by defaulting a nil context provider to context.Background, avoiding a panic for callers that construct the config without the new field. - Avoid permanently caching an empty remote default agent name after a transient lookup failure. CurrentAgentName now caches only a successful non-empty default and retries later otherwise, while still using WithoutCancel to avoid cancelled-context poisoning.
Drop Exec and ExecWithEnv wrappers that minted context.Background and update binary_required tests to pass t.Context() through the existing context-aware helpers.
… generators Thread context into runtime.New/NewLocalRuntime so constructor-time model validation no longer needs context.TODO. Also add context to Runtime.TitleGenerator so title-model selection is correlated with the caller context. Update runtime constructors across CLI/server/examples/tests and pass context through existing call sites.
Keep NewSSRFSafeTransport side-effect-free by moving proxy allowlist DNS resolution from construction to DialContext. The lookup now uses the request's dial context while still allowing explicitly configured proxies to bypass SSRF private-IP rejection. Update the regression test to document dial-time allowlist resolution.
Make sound.Play accept a context and pass the chat page's context provider from runtime event handlers. Audio playback still detaches from cancellation with context.WithoutCancel, but preserves context values for logging/trace correlation and no longer mints a fresh root.
Thread context through sqlite/session/RAG/memory DB setup so constructors and migrations no longer mint fresh roots. Fake proxy cleanup and askpass connection handling now use context providers derived from their startup context. LocalRuntime stores a context provider for tool-change refresh callbacks. The remaining suppressions are intentionally detached event hooks/MCP supervisor callbacks or compatibility fallbacks.
Make sqlite.NewMemoryDatabase side-effect-free again: the constructor now records the path only, and the database is opened/migrated lazily from the first ctx-bearing operation. This keeps memory.CreateToolSet as a pure constructor and removes its context.Background root while preserving retry behavior if initialization fails. Thread context through related DB setup helpers and keep shutdown checkpointing detached via CheckpointAndClose.
Change lifecycle.Policy.OnRestart to receive the supervisor reconnect context and pass it through from tryRestart. MCP restart cache refresh now uses that lifecycle context instead of minting a fresh background root.
Make Embedder usage handlers receive the embedding operation context instead of forcing VectorStore to capture a context provider at construction time. Remove VectorStoreConfig.Context and the context.Background compatibility fallback from NewVectorStore.
This comment was marked as outdated.
This comment was marked as outdated.
2a5d69c to
457dd7a
Compare
docker-agent
left a comment
There was a problem hiding this comment.
Assessment: 🔴 CRITICAL
This PR threads context.Context throughout the codebase — a well-structured refactor that correctly uses context.WithoutCancel for detached lifecycle objects. One confirmed high-severity regression and three medium-severity issues were found in the newly introduced + lines. The bulk of the mechanical context threading (test harnesses, logging, RAG pipeline) looks correct.
| if h.askpassStarted { | ||
| return h.askpass | ||
| } | ||
| h.askpassStarted = true |
There was a problem hiding this comment.
[HIGH] ensureAskpass permanently disables askpass if the first call's context is cancelled
h.askpassStarted is set to true before startAskpassServer is called, and is never reset on failure. If the context passed to ensureAskpass on the first-ever sudo command is cancelled before lnConfig.Listen completes, startAskpassServer returns an error and h.askpass remains nil. All subsequent shell commands then receive nil from ensureAskpass (the guard if h.askpassStarted { return h.askpass } short-circuits), permanently disabling the askpass server for the lifetime of the shellHandler — sudo commands will run without the password-prompt helper for the rest of the session.
Before this PR, startAskpassServer was called with context.Background() (uncancellable), so this failure mode did not exist. The fix is to either (a) reset h.askpassStarted = false in the error path so the next command retries, or (b) call startAskpassServer with a long-lived context (e.g., context.WithoutCancel(context.Background()) or a handler-lifetime context) rather than the per-request context.
| }, | ||
| }).DialContext | ||
| t.DialContext = func(ctx context.Context, network, address string) (net.Conn, error) { | ||
| proxies := proxyDialAllowlist(ctx) |
There was a problem hiding this comment.
[MEDIUM] proxyDialAllowlist (potentially blocking DNS) called on every TCP dial
proxyDialAllowlist(ctx) is now called inside the DialContext closure, meaning it runs on every individual TCP connection the transport makes — including connection-pool replenishment, retries, and keep-alive re-dials. The function reads 6 environment variables and, for hostname-based proxy specs, calls proxyHostPorts which performs a DNS lookup with a 5-second timeout to resolve each proxy hostname to its IP addresses.
Before this PR the allowlist needed a ctx to pass to proxyHostPorts, which forced the call into the per-dial closure. The fix is to pre-compute the allowlist once at transport-construction time (environment variables and proxy hostnames rarely change at runtime) and capture the result in the closure, or to use sync.Once / a short-TTL cache. If env-var changes must be picked up at dial time, a sync.Once-per-transport or a cached value with a TTL would bound the DNS overhead.
| } | ||
|
|
||
| rt, _, err := a.newRuntime(sess.WorkingDir) | ||
| rt, _, err := a.newRuntime(ctx, sess.WorkingDir) |
There was a problem hiding this comment.
[MEDIUM] ResumeSession passes request-scoped context to newRuntime without context.WithoutCancel
The runtime created here is stored in a.sessions (a long-lived map) and is expected to serve the resumed session indefinitely. However, it is constructed with ctx — the HTTP handler's request context, which is cancelled as soon as the RPC call returns.
If runtime.New / newRuntime stores the context for use in background goroutines (e.g. tool list refresh via emitToolsChanged, which is now wired through r.ctx()), the runtime's background work will be cancelled when the RPC handler returns. The parallel paths CreateSession and StartSession (line 205) exhibit the same pattern, so this may be intentional, but the PR's stated policy for "genuinely detached" objects is to wrap with context.WithoutCancel. Compare with NewLocalRuntime, which already wraps: ctx: func() context.Context { return context.WithoutCancel(ctx) } — so the runtime itself does the right thing internally. If newRuntime calls NewLocalRuntime the internal closure will protect background work; the only concern is if any state is derived from the raw ctx before that wrapping.
| } | ||
|
|
||
| r := &LocalRuntime{ | ||
| ctx: func() context.Context { return context.WithoutCancel(ctx) }, |
There was a problem hiding this comment.
[MEDIUM] context.WithoutCancel does not strip deadlines — r.ctx() inherits the caller's deadline
Go's context.WithoutCancel (Go 1.21+) strips the cancellation signal and cause but explicitly preserves all other context properties, including deadlines. From the Go docs: "WithoutCancel returns a copy of parent that is not canceled when parent is canceled. The returned context returns no Deadline …" — wait, actually it DOES carry the deadline if the parent has one; it only removes the cancel chain.
If NewLocalRuntime is ever called with a context that has a deadline (e.g. a request-scoped context with a 30-second timeout), r.ctx() will return a context that expires at that same deadline. Background work derived from r.ctx() — most notably emitToolsChanged which calls context.WithTimeout(r.ctx(), toolsChangedTimeout) — will use a base context that may already be expired or nearly expired, silently truncating tool-list refreshes throughout the session.
The fix is to use context.WithoutCancel(context.Background()) or to derive from a process-lifetime context rather than from the caller's context when constructing the stored background context.
| func (r *LocalRuntime) Steer(msg QueuedMessage) error { | ||
| if !r.steerQueue.Enqueue(context.Background(), msg) { | ||
| func (r *LocalRuntime) Steer(ctx context.Context, msg QueuedMessage) error { | ||
| if !r.steerQueue.Enqueue(ctx, msg) { |
There was a problem hiding this comment.
[MEDIUM] Context cancellation silently misreported as "queue full" in Steer/FollowUp
Before this PR, Enqueue was called with context.Background() (uncancellable), so Enqueue returning false always meant the queue was actually full. Now the caller's ctx is threaded in. If MessageQueue.Enqueue blocks waiting for capacity and respects context cancellation (returning false on ctx.Done()), a cancelled context will cause the error message "steer queue full" or "follow-up queue full" to be returned — masking the real cause (cancellation) as a capacity error.
This matters for callers that inspect the error or log it for debugging: a cancelled follow-up (e.g. from a request that timed out) will look like a capacity problem in logs, making operational diagnosis harder. If Enqueue is truly non-blocking (returns immediately if the channel buffer is full), the concern only applies to concurrent cancellation racing the send, which is less likely. Consider returning ctx.Err() when the context is done, or documenting that cancellation maps to the same "full" error.
This branch integrates the
Lint/ContextConnectivitycop fromrubocop-goas a whole-program static analysis pass and then resolves every violation it finds. The cop detects places where acontext.Contextis available but a derived or background context is used instead — patterns that silently lose deadlines, cancellation signals, and trace spans.The bulk of the work is mechanical but intentional: context threading was pushed down into the
Runtimeinterface and its implementations, through RAG embedding and chunking helpers, into the SQLite memory backend (now lazily opened with the caller's context), through the telemetryTrackSynchronouspath, into proxy-allowlist resolution at dial time, and across dozens of test harnesses that previously created detached roots withcontext.Background(). Places where a genuinely detached context is correct — shutdown flushes, background notification playback — are suppressed withcontext.WithoutCancelderivation, which was added to therubocop-godependency in the same batch.No behaviour changes are intended.
task --force lintandtask --force testboth pass.