Skip to content

feat(compaction): allow a dedicated model for session summary generation (and bound summariser input) #3241

Description

@simonferquel-clanker

Summary

Allow session compaction to use a dedicated (smaller/faster) model for summary
generation, independent of the agent's primary model — and optionally bound the input
fed to the summariser.

Current behaviour

When a session is compacted (manual /compact, the proactive ~90% threshold trigger, or
post-overflow recovery), the summary is produced by running the compaction agent against the
session agent's own primary model:

  • doCompactcompactor.RunLLM{ RunAgent: r.runCompactionAgent }
  • runCompactionAgent(ctx, a, sess) uses a = resolveSessionAgent(sess)
  • compactionContextLimit clones a.Model(ctx)

So the summary call is made with whatever model the agent normally uses.

Problem

The summary call has to ingest the entire conversation, so it is the single most
expensive and slowest LLM call in a session — and it only ever runs when the context is
already large. When an agent's primary model is large/slow/expensive (e.g. a top-tier model
with a very large context window), summary generation is correspondingly slow and costly.
Callers that wrap Summarize/compaction in a timeout are especially exposed: the one call
most likely to exceed a deadline is the summary over a big context, i.e. exactly when
compaction is needed.

There is currently no configuration knob for this — the only compaction extension points are
the pre_compact / before_compaction / after_compaction hooks. before_compaction can
supply a full summary to bypass the LLM, but that is not a practical place to implement real
summarisation.

Requests

  1. Dedicated compaction model. Let users configure the model used for summary generation
    independently of the agent's primary model — e.g. a runtime option (WithCompactionModel)
    and/or an agent/team config field. A fast, cheap model is usually more than adequate for
    summarisation.
  2. Bounded summariser input (optional). Allow capping the number of tokens fed to the
    summariser so summary latency and cost are predictable regardless of total conversation
    size.

Pointers

  • pkg/runtime/session_compaction.godoCompact, runCompactionAgent,
    compactionContextLimit
  • pkg/runtime/compactor/compactor.go — keep/summary token budgets

Benefit

Faster, cheaper, and more reliable compaction, with summary cost decoupled from the agent's
primary model choice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/modelsLLM model integrations and model providersarea/runtimeRuntime engine, agent loop execution, tool dispatch, loop detectionarea/sessionsFor features/issues/fixes related to session lifecycle (resume, persistence, export)
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions