Skip to content

refactor: thread context throughout and enforce ContextConnectivity lint rule#3279

Open
dgageot wants to merge 21 commits into
docker:mainfrom
dgageot:worktree-board-7d06d68b16fd88dd
Open

refactor: thread context throughout and enforce ContextConnectivity lint rule#3279
dgageot wants to merge 21 commits into
docker:mainfrom
dgageot:worktree-board-7d06d68b16fd88dd

Conversation

@dgageot

@dgageot dgageot commented Jun 26, 2026

Copy link
Copy Markdown
Member

This branch integrates the Lint/ContextConnectivity cop from rubocop-go as a whole-program static analysis pass and then resolves every violation it finds. The cop detects places where a context.Context is available but a derived or background context is used instead — patterns that silently lose deadlines, cancellation signals, and trace spans.

The bulk of the work is mechanical but intentional: context threading was pushed down into the Runtime interface and its implementations, through RAG embedding and chunking helpers, into the SQLite memory backend (now lazily opened with the caller's context), through the telemetry TrackSynchronous path, into proxy-allowlist resolution at dial time, and across dozens of test harnesses that previously created detached roots with context.Background(). Places where a genuinely detached context is correct — shutdown flushes, background notification playback — are suppressed with context.WithoutCancel derivation, which was added to the rubocop-go dependency in the same batch.

No behaviour changes are intended. task --force lint and task --force test both pass.

@dgageot dgageot requested a review from a team as a code owner June 26, 2026 17:15
dgageot added 20 commits June 26, 2026 19:16
TrackSynchronous received a context.Context but discarded it, minting a
detached context.Background() instead. Pass the caller's context through;
behavior is unchanged (track ignores ctx today) but the context is no
longer disconnected from the program's root.

Found by Lint/ContextConnectivity.
…ctivity)

Annotate the remaining context.Background()/TODO() call sites that are
legitimate independent roots with //rubocop:disable Lint/ContextConnectivity.
These fall into a few categories that cannot (and should not) derive from
the program's root context:

- Bubble Tea event-loop handlers (pkg/tui): Update/Init and tea.Cmd
  callbacks have no context parameter, and appModel deliberately does not
  store the root ctx — threading it would make persistence writes
  cancellable at shutdown.
- runtime.Runtime / interface implementations with fixed signatures
  (remote_runtime, LocalRuntime Steer/FollowUp, accessors).
- Shutdown/flush paths that must outlive the cancelled parent ctx
  (otel providers, HTTP server Shutdown, OAuth callback servers).
- sync.Once / memoized initializers and deliberately detached subprocesses
  (gateway catalog, LSP process, sound, event hooks).
- Test/e2e helpers (pkg/fake, e2e/binary).

No behavior change.
Picks up Lint/ContextConnectivity treating context.WithoutCancel as a
derivation, so cleanup/flush contexts wrapped with it are verified as
connected to the program's root instead of flagged as detached.
Replace detached context.Background() in cleanup/flush paths that have a
parent context in scope with context.WithoutCancel(parent). This keeps the
parent's trace context (OTel spans/baggage) while still letting the work
outlive the parent's cancellation — strictly better than a fresh root.

Converted (and dropped their Lint/ContextConnectivity suppressions):
- cmd/root/otel.go: provider flush goroutine, logger-provider error path,
  and shutdownTracerProvider (now takes ctx)
- pkg/mcp/server.go, pkg/chatserver/server.go: graceful httpServer.Shutdown
- pkg/runtime/remote_runtime.go, pkg/tools/mcp/oauth.go,
  pkg/tools/mcp/oauth_login.go: OAuth callback server shutdown
- pkg/tools/builtin/lsp/lsp_lifecycle.go: WithoutCancel(callerCtx) now
  carries the trace context the code previously injected by hand

Requires the rubocop-go bump that treats WithoutCancel as a derivation, so
these contexts are verified as connected rather than flagged as detached.
The MCP catalog is loaded once per process via sync.Once. Pass the first
caller's context into the one-shot loader and wrap it with
context.WithoutCancel: on the first call this attaches the fetch to the
program's context (trace context/baggage) while detaching cancellation, so
a cancelled first caller still cannot poison the memoized result for
everyone else. Subsequent callers' contexts are unused — the work has
already run.

ServerSpec now honours its ctx parameter (previously discarded), and the
test override hook moves to a dedicated catalogOverride var. Drops the
Lint/ContextConnectivity suppression.
Add a context.Context parameter to the Runtime interface methods that
previously took none and minted a detached context.Background()/TODO()
internally: Steer, FollowUp, SetCurrentAgent, and CurrentAgentName.

- RemoteRuntime now uses the caller's context for its network calls
  (SteerSession/FollowUpSession/GetAgent) and its one-time default-agent
  resolution (via WithoutCancel, so a cancelled first caller can't poison
  the cached default).
- LocalRuntime keeps a private context-free currentAgentName() for its many
  internal callers (event callbacks, logging) that have no context; the
  exported method exists to satisfy the interface.
- Callers thread their existing context: app.App (TUI boundary mints a root
  in SwitchAgent), SessionManager, cli runner.
- Threading context into SetCurrentAgent surfaced a slog.Debug that should
  be slog.DebugContext (caught by Lint/SlogContextual); fixed.

Interface methods with fixed signatures and no caller context (e.g.
TitleGenerator) and internal leaf helpers below the context boundary
(getAgentModelID, startedToolNames, AgentsInfo \u2014 where the context only
feeds slog correlation) keep documented suppressions.
… a parent

Two more pockets that had a parent context in scope:

- pkg/remote/transport.go: the memoized Docker Desktop probe ignored
  NewTransport's ctx and used context.Background(). Use
  context.WithoutCancel(ctx) so the one-time probe keeps the caller's trace
  context while detaching cancellation (same rationale as the gateway
  catalog memoization).
- pkg/app/app.go: the three drain-after-cancel loops (Run, RunWithMessage,
  RunSkillFork) forwarded the final StreamStoppedEvent via
  context.Background() once ctx was cancelled. Use context.WithoutCancel(ctx)
  to keep the trace context for that last event. Removes three now-stale
  //nolint:gosec directives whose justification referenced the removed
  context.Background() calls.

No behavior change.
stopToolSets ran shutdown cleanup with a detached context.Background() and a
30s timeout. Give it a ctx parameter and use
context.WithTimeout(context.WithoutCancel(ctx), 30s): cleanup still survives
the caller's cancellation (it often runs on shutdown or tab close, after ctx
is done) but now keeps the caller's trace context. Every caller already has a
context in scope (cobra cmd.Context, the CreateSession/runOrExec ctx, or the
session spawner's spawnCtx). Drops the Lint/ContextConnectivity suppression.
Second review of Lint/ContextConnectivity suppressions found more places
where a parent context was available without changing cancellation semantics.
Telemetry, OAuth callback startup, shell askpass startup, LSP lazy start,
attached-session lifecycle, skills loading, RAG pricing, and App/TUI child
components now preserve context values instead of minting fresh roots.
Suppressions drop from 92 to 27.
Follow-up review of the remaining Lint/ContextConnectivity suppressions:

- Add context to RAG DocumentProcessor.Process and ProcessFile so tree-sitter
  parsing uses the caller's cancellation/trace context instead of a fresh root.
- Thread context through runtime/team metadata helpers for model/tool lookup
  paths where the context is now available (AgentConfigInfo, AgentsInfo,
  getEffectiveModelID, agentModelLabel, startedToolNames).
- Update affected tests and contextual logging in tree-sitter processing.

This drops remaining suppressions from 27 to 23; the rest are structural roots
(DB bootstrap, detached hooks/processes, self-owned callbacks, and test support).
Address review findings from the context-connectivity refactor:

- Keep VectorStoreConfig backward-compatible by defaulting a nil context
  provider to context.Background, avoiding a panic for callers that construct
  the config without the new field.
- Avoid permanently caching an empty remote default agent name after a
  transient lookup failure. CurrentAgentName now caches only a successful
  non-empty default and retries later otherwise, while still using
  WithoutCancel to avoid cancelled-context poisoning.
Drop Exec and ExecWithEnv wrappers that minted context.Background and update binary_required tests to pass t.Context() through the existing context-aware helpers.
… generators

Thread context into runtime.New/NewLocalRuntime so constructor-time model validation no longer needs context.TODO. Also add context to Runtime.TitleGenerator so title-model selection is correlated with the caller context.

Update runtime constructors across CLI/server/examples/tests and pass context through existing call sites.
Keep NewSSRFSafeTransport side-effect-free by moving proxy allowlist DNS resolution from construction to DialContext. The lookup now uses the request's dial context while still allowing explicitly configured proxies to bypass SSRF private-IP rejection.

Update the regression test to document dial-time allowlist resolution.
Make sound.Play accept a context and pass the chat page's context provider from runtime event handlers. Audio playback still detaches from cancellation with context.WithoutCancel, but preserves context values for logging/trace correlation and no longer mints a fresh root.
Thread context through sqlite/session/RAG/memory DB setup so constructors and migrations no longer mint fresh roots. Fake proxy cleanup and askpass connection handling now use context providers derived from their startup context. LocalRuntime stores a context provider for tool-change refresh callbacks.

The remaining suppressions are intentionally detached event hooks/MCP supervisor callbacks or compatibility fallbacks.
Make sqlite.NewMemoryDatabase side-effect-free again: the constructor now records the path only, and the database is opened/migrated lazily from the first ctx-bearing operation. This keeps memory.CreateToolSet as a pure constructor and removes its context.Background root while preserving retry behavior if initialization fails.

Thread context through related DB setup helpers and keep shutdown checkpointing detached via CheckpointAndClose.
Change lifecycle.Policy.OnRestart to receive the supervisor reconnect context and pass it through from tryRestart. MCP restart cache refresh now uses that lifecycle context instead of minting a fresh background root.
Make Embedder usage handlers receive the embedding operation context instead of forcing VectorStore to capture a context provider at construction time. Remove VectorStoreConfig.Context and the context.Background compatibility fallback from NewVectorStore.
@aheritier aheritier marked this pull request as draft June 26, 2026 17:33
@aheritier aheritier added area/a2a Agent-to-Agent protocol, A2A server, inter-agent communication area/ci CI/CD workflows and pipeline area/cli CLI commands, flags, output formatting area/deps Dependency updates and version bumps area/gateway Gateway, proxy, and routing area/mcp MCP protocol, MCP tool servers, integration area/rag For work/issues that have to do with the RAG features area/runtime Runtime engine, agent loop execution, tool dispatch, loop detection labels Jun 26, 2026
@aheritier aheritier added area/testing Test infrastructure, CI/CD, test runners, evaluation kind/refactor PR refactors code without behavior change status/needs-rebase PR has merge conflicts or is out of date with main labels Jun 26, 2026
@aheritier

This comment was marked as outdated.

@dgageot dgageot force-pushed the worktree-board-7d06d68b16fd88dd branch from 2a5d69c to 457dd7a Compare June 26, 2026 17:34
@aheritier aheritier removed the status/needs-rebase PR has merge conflicts or is out of date with main label Jun 26, 2026
@dgageot dgageot marked this pull request as ready for review June 26, 2026 20:06

@docker-agent docker-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🔴 CRITICAL

This PR threads context.Context throughout the codebase — a well-structured refactor that correctly uses context.WithoutCancel for detached lifecycle objects. One confirmed high-severity regression and three medium-severity issues were found in the newly introduced + lines. The bulk of the mechanical context threading (test harnesses, logging, RAG pipeline) looks correct.

if h.askpassStarted {
return h.askpass
}
h.askpassStarted = true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] ensureAskpass permanently disables askpass if the first call's context is cancelled

h.askpassStarted is set to true before startAskpassServer is called, and is never reset on failure. If the context passed to ensureAskpass on the first-ever sudo command is cancelled before lnConfig.Listen completes, startAskpassServer returns an error and h.askpass remains nil. All subsequent shell commands then receive nil from ensureAskpass (the guard if h.askpassStarted { return h.askpass } short-circuits), permanently disabling the askpass server for the lifetime of the shellHandler — sudo commands will run without the password-prompt helper for the rest of the session.

Before this PR, startAskpassServer was called with context.Background() (uncancellable), so this failure mode did not exist. The fix is to either (a) reset h.askpassStarted = false in the error path so the next command retries, or (b) call startAskpassServer with a long-lived context (e.g., context.WithoutCancel(context.Background()) or a handler-lifetime context) rather than the per-request context.

Comment thread pkg/httpclient/ssrf.go
},
}).DialContext
t.DialContext = func(ctx context.Context, network, address string) (net.Conn, error) {
proxies := proxyDialAllowlist(ctx)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] proxyDialAllowlist (potentially blocking DNS) called on every TCP dial

proxyDialAllowlist(ctx) is now called inside the DialContext closure, meaning it runs on every individual TCP connection the transport makes — including connection-pool replenishment, retries, and keep-alive re-dials. The function reads 6 environment variables and, for hostname-based proxy specs, calls proxyHostPorts which performs a DNS lookup with a 5-second timeout to resolve each proxy hostname to its IP addresses.

Before this PR the allowlist needed a ctx to pass to proxyHostPorts, which forced the call into the per-dial closure. The fix is to pre-compute the allowlist once at transport-construction time (environment variables and proxy hostnames rarely change at runtime) and capture the result in the closure, or to use sync.Once / a short-TTL cache. If env-var changes must be picked up at dial time, a sync.Once-per-transport or a cached value with a TTL would bound the DNS overhead.

Comment thread pkg/acp/agent.go
}

rt, _, err := a.newRuntime(sess.WorkingDir)
rt, _, err := a.newRuntime(ctx, sess.WorkingDir)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] ResumeSession passes request-scoped context to newRuntime without context.WithoutCancel

The runtime created here is stored in a.sessions (a long-lived map) and is expected to serve the resumed session indefinitely. However, it is constructed with ctx — the HTTP handler's request context, which is cancelled as soon as the RPC call returns.

If runtime.New / newRuntime stores the context for use in background goroutines (e.g. tool list refresh via emitToolsChanged, which is now wired through r.ctx()), the runtime's background work will be cancelled when the RPC handler returns. The parallel paths CreateSession and StartSession (line 205) exhibit the same pattern, so this may be intentional, but the PR's stated policy for "genuinely detached" objects is to wrap with context.WithoutCancel. Compare with NewLocalRuntime, which already wraps: ctx: func() context.Context { return context.WithoutCancel(ctx) } — so the runtime itself does the right thing internally. If newRuntime calls NewLocalRuntime the internal closure will protect background work; the only concern is if any state is derived from the raw ctx before that wrapping.

Comment thread pkg/runtime/runtime.go
}

r := &LocalRuntime{
ctx: func() context.Context { return context.WithoutCancel(ctx) },

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] context.WithoutCancel does not strip deadlines — r.ctx() inherits the caller's deadline

Go's context.WithoutCancel (Go 1.21+) strips the cancellation signal and cause but explicitly preserves all other context properties, including deadlines. From the Go docs: "WithoutCancel returns a copy of parent that is not canceled when parent is canceled. The returned context returns no Deadline …" — wait, actually it DOES carry the deadline if the parent has one; it only removes the cancel chain.

If NewLocalRuntime is ever called with a context that has a deadline (e.g. a request-scoped context with a 30-second timeout), r.ctx() will return a context that expires at that same deadline. Background work derived from r.ctx() — most notably emitToolsChanged which calls context.WithTimeout(r.ctx(), toolsChangedTimeout) — will use a base context that may already be expired or nearly expired, silently truncating tool-list refreshes throughout the session.

The fix is to use context.WithoutCancel(context.Background()) or to derive from a process-lifetime context rather than from the caller's context when constructing the stored background context.

Comment thread pkg/runtime/runtime.go
func (r *LocalRuntime) Steer(msg QueuedMessage) error {
if !r.steerQueue.Enqueue(context.Background(), msg) {
func (r *LocalRuntime) Steer(ctx context.Context, msg QueuedMessage) error {
if !r.steerQueue.Enqueue(ctx, msg) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Context cancellation silently misreported as "queue full" in Steer/FollowUp

Before this PR, Enqueue was called with context.Background() (uncancellable), so Enqueue returning false always meant the queue was actually full. Now the caller's ctx is threaded in. If MessageQueue.Enqueue blocks waiting for capacity and respects context cancellation (returning false on ctx.Done()), a cancelled context will cause the error message "steer queue full" or "follow-up queue full" to be returned — masking the real cause (cancellation) as a capacity error.

This matters for callers that inspect the error or log it for debugging: a cancelled follow-up (e.g. from a request that timed out) will look like a capacity problem in logs, making operational diagnosis harder. If Enqueue is truly non-blocking (returns immediately if the channel buffer is full), the concern only applies to concurrent cancellation racing the send, which is less likely. Consider returning ctx.Err() when the context is done, or documenting that cancellation maps to the same "full" error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/a2a Agent-to-Agent protocol, A2A server, inter-agent communication area/ci CI/CD workflows and pipeline area/cli CLI commands, flags, output formatting area/deps Dependency updates and version bumps area/gateway Gateway, proxy, and routing area/mcp MCP protocol, MCP tool servers, integration area/rag For work/issues that have to do with the RAG features area/runtime Runtime engine, agent loop execution, tool dispatch, loop detection area/testing Test infrastructure, CI/CD, test runners, evaluation kind/refactor PR refactors code without behavior change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants