Summary
Allow session compaction to use a dedicated (smaller/faster) model for summary
generation, independent of the agent's primary model — and optionally bound the input
fed to the summariser.
Current behaviour
When a session is compacted (manual /compact, the proactive ~90% threshold trigger, or
post-overflow recovery), the summary is produced by running the compaction agent against the
session agent's own primary model:
doCompact → compactor.RunLLM{ RunAgent: r.runCompactionAgent }
runCompactionAgent(ctx, a, sess) uses a = resolveSessionAgent(sess)
compactionContextLimit clones a.Model(ctx)
So the summary call is made with whatever model the agent normally uses.
Problem
The summary call has to ingest the entire conversation, so it is the single most
expensive and slowest LLM call in a session — and it only ever runs when the context is
already large. When an agent's primary model is large/slow/expensive (e.g. a top-tier model
with a very large context window), summary generation is correspondingly slow and costly.
Callers that wrap Summarize/compaction in a timeout are especially exposed: the one call
most likely to exceed a deadline is the summary over a big context, i.e. exactly when
compaction is needed.
There is currently no configuration knob for this — the only compaction extension points are
the pre_compact / before_compaction / after_compaction hooks. before_compaction can
supply a full summary to bypass the LLM, but that is not a practical place to implement real
summarisation.
Requests
- Dedicated compaction model. Let users configure the model used for summary generation
independently of the agent's primary model — e.g. a runtime option (WithCompactionModel)
and/or an agent/team config field. A fast, cheap model is usually more than adequate for
summarisation.
- Bounded summariser input (optional). Allow capping the number of tokens fed to the
summariser so summary latency and cost are predictable regardless of total conversation
size.
Pointers
pkg/runtime/session_compaction.go — doCompact, runCompactionAgent,
compactionContextLimit
pkg/runtime/compactor/compactor.go — keep/summary token budgets
Benefit
Faster, cheaper, and more reliable compaction, with summary cost decoupled from the agent's
primary model choice.
Summary
Allow session compaction to use a dedicated (smaller/faster) model for summary
generation, independent of the agent's primary model — and optionally bound the input
fed to the summariser.
Current behaviour
When a session is compacted (manual
/compact, the proactive ~90% threshold trigger, orpost-overflow recovery), the summary is produced by running the compaction agent against the
session agent's own primary model:
doCompact→compactor.RunLLM{ RunAgent: r.runCompactionAgent }runCompactionAgent(ctx, a, sess)usesa = resolveSessionAgent(sess)compactionContextLimitclonesa.Model(ctx)So the summary call is made with whatever model the agent normally uses.
Problem
The summary call has to ingest the entire conversation, so it is the single most
expensive and slowest LLM call in a session — and it only ever runs when the context is
already large. When an agent's primary model is large/slow/expensive (e.g. a top-tier model
with a very large context window), summary generation is correspondingly slow and costly.
Callers that wrap
Summarize/compaction in a timeout are especially exposed: the one callmost likely to exceed a deadline is the summary over a big context, i.e. exactly when
compaction is needed.
There is currently no configuration knob for this — the only compaction extension points are
the
pre_compact/before_compaction/after_compactionhooks.before_compactioncansupply a full summary to bypass the LLM, but that is not a practical place to implement real
summarisation.
Requests
independently of the agent's primary model — e.g. a runtime option (
WithCompactionModel)and/or an agent/team config field. A fast, cheap model is usually more than adequate for
summarisation.
summariser so summary latency and cost are predictable regardless of total conversation
size.
Pointers
pkg/runtime/session_compaction.go—doCompact,runCompactionAgent,compactionContextLimitpkg/runtime/compactor/compactor.go— keep/summary token budgetsBenefit
Faster, cheaper, and more reliable compaction, with summary cost decoupled from the agent's
primary model choice.