Skip to content

LCORE-1572: Integrate conversation compaction into the query flow#1796

Open
max-svistunov wants to merge 19 commits into
lightspeed-core:mainfrom
max-svistunov:lcore-1572-integrate-compaction-query-flow
Open

LCORE-1572: Integrate conversation compaction into the query flow#1796
max-svistunov wants to merge 19 commits into
lightspeed-core:mainfrom
max-svistunov:lcore-1572-integrate-compaction-query-flow

Conversation

@max-svistunov
Copy link
Copy Markdown
Contributor

@max-svistunov max-svistunov commented May 26, 2026

Description

Wires conversation compaction into the request flow so long conversations are summarized and continue working instead of failing with HTTP 413 (Conversation Compaction epic, LCORE-1631). Builds on the token estimator (LCORE-1569), the summarization module (LCORE-1570), and the summary cache (LCORE-1571).

⚠️ This PR reverses a design decision — please note

The conversation-handling approach changed during implementation, away from what the in-repo design docs originally specified:

  • Originally specified (spec R10 in docs/design/conversation-compaction/conversation-compaction.md, and spike Decision 6, "option B" in docs/design/conversation-compaction/conversation-compaction-spike.md): inject the summary as a marked item into the existing Llama Stack conversation, keep passing the conversation parameter, and "select only items from the last summary marker onward."
  • What this PR does instead ("option A"): build the LLM input explicitly (summaries + recent verbatim turns + new query) and stop passing the conversation parameter once a conversation is compacted.
  • Why the switch: the originally-specified approach is not implementable on the pinned llama-stack 0.6.0. When the conversation parameter is set, Llama Stack always reloads the full stored message history and exposes no marker-based selection hook — so a marker item cannot limit what the model actually sees. This was verified empirically (the experiment is recorded in the spike doc). Option A is, in fact, the spike's original recommendation, which a later spike edit had replaced with the marker approach; this reinstates it.
  • Docs updated to match this PR: conversation-compaction.mdR10, R12, the architecture flow, and a new Changelog entry. The investigation, experiment, and decision history live in conversation-compaction-spike.md.

With option A, conversation_id is preserved throughout and the full history (including pre-compaction turns) remains in the conversation's items for UI/audit.

Endpoints. Compaction is applied to every endpoint that builds context from a growing conversation and calls the Responses API:

  • /v1/query — non-streaming.
  • /v1/streaming_query — native SSE; emits a compaction event before the summarization call (R12). Only conversations that actually compact take the in-stream path, so non-compacting requests are byte-for-byte unchanged (including HTTP error handling).
  • A2A executor — inline compaction (not a browser SSE stream, so no progress event).
  • /v1/responses — OpenAI-compatible; compacts silently (no non-standard SSE event) to preserve wire compatibility.

The /v1/rlsapi inference path is stateless (no stored conversation) and is intentionally out of scope.

Key changes.

  • New src/utils/conversation_compaction.py: apply_compaction() / apply_compaction_blocking(), the needs_compaction_path() routing predicate, marker helpers, and a per-conversation lock (R11).
  • ResponsesApiParams.omit_conversation flag — drops the conversation parameter from the request body in compacted mode while keeping it on the object for identity.
  • AppConfig.compaction accessor; endpoint wiring in query.py, streaming_query.py, a2a.py, responses.py; ResponsesContext.compacted_original_input for turn storage.
  • Summary cache (LCORE-1571) as the source of truth for summaries and the home of the persisted recursive fold (R3): apply_compaction reads summaries via get_summaries, writes each new chunk via store_summary, and — when the summaries' own token count crosses the threshold — folds them via recursively_resummarize and persists the fold via replace_summaries. Falls back to the Llama Stack marker texts when no persisting cache is configured. configured_conversation_cache() resolves the cache for the endpoints; A2A runs marker-only (no user_id for the cache key).

Disabled by default (compaction.enabled: false); when off, all endpoints behave exactly as before.

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement
  • Benchmarks improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Claude (Claude Code)
  • Generated by: Claude Opus 4.7

Related Tickets & Documents

  • Related Issue # LCORE-1572
  • Closes # LCORE-1572

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  1. Unit tests: uv run python -m pytest tests/unit/utils/test_conversation_compaction.py tests/unit/app/endpoints/test_query.py tests/unit/app/endpoints/test_a2a.py tests/unit/app/endpoints/test_streaming_query.py tests/unit/app/endpoints/test_responses.py -q -> all pass. Full suite uv run python -m pytest tests/unit -> 2399 passed, 1 skipped (no regressions; the compaction-disabled default leaves all endpoints unchanged).

  2. End-to-end against a live stack (Lightspeed Stack -> Llama Stack 0.6.0 -> OpenAI gpt-4o-mini), with compaction.enabled: true and a 2000-token window so compaction triggers after a few turns:

    • /v1/query: input-token sawtooth across turns (~251 -> ~985 -> drops to ~734 when compaction fires), the secret planted in turn 0 was still recalled after it had been summarized out of the LLM context, conversation_id identical across all turns, and zero 413 errors. Log confirms utils.compaction: Summarizing N conversation items.
    • /v1/streaming_query: SSE event order was start -> compaction -> token -> turn_complete -> end (the compaction progress event is flushed before summarization), conversation_id unchanged.
    • History preservation: the Llama Stack conversation retained all items (original turns + appended turns + the summary marker), including the pre-compaction turn — full audit trail intact.
  3. Persisted recursive fold (R3), end-to-end with a SQLite conversation cache and a small (1200-token) window so summaries cross the fold threshold: across many turns the accumulated summaries were repeatedly folded (Folding N summaries (... tokens) for conversation ... in the log, followed by Recursively re-summarizing N existing summary chunks), replace_summaries collapsed them to a single row in the conversation_summaries table, and a secret planted in turn 0 was still recalled after surviving both summarization and folding. conversation_id stable throughout.

Summary by CodeRabbit

  • New Features

    • Conversation compaction: long history is summarized into explicit inputs so calls can omit full conversation while preserving recent context; compacted turns are appended back to maintain history.
    • Streaming: compacted streams emit a visible "compaction started" progress event; the standard responses path compacts silently to preserve streaming wire format.
    • Compaction-aware persistence: original user input and structured output items are preserved for compacted turns.
  • Documentation

    • Design doc revised to describe explicit-input compaction flow, endpoint applicability, and limitations.
  • Tests

    • Expanded unit tests covering compaction logic, streaming events, caching/folding, and persistence.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

Warning

Review limit reached

@max-svistunov, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 30 minutes and 17 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4f8cdf34-67e7-4a02-8c3e-2ebf8f7e5111

📥 Commits

Reviewing files that changed from the base of the PR and between 5a708c4 and ce8627e.

📒 Files selected for processing (2)
  • docs/design/conversation-compaction/conversation-compaction.md
  • tests/unit/app/endpoints/test_streaming_query.py

Walkthrough

This PR implements explicit-input conversation compaction across query, responses, streaming, and A2A endpoints. The system omits the Responses API conversation parameter, constructs LLM input from marker summaries + recent items + the new query, emits streaming compaction events where applicable, and appends completed turns back into conversation history.

Changes

Conversation Compaction Feature

Layer / File(s) Summary
Design specification and configuration
docs/design/conversation-compaction/conversation-compaction.md, src/configuration.py
Design doc specifies explicit-input compaction (omit conversation, construct explicit input from markers + recent items, append completed turns); AppConfig exposes compaction configuration property.
Core compaction engine and helpers
src/utils/conversation_compaction.py
Implements compaction engine with marker detection, explicit input construction, per-conversation asyncio locks, token estimation and gating, cache read/write/folding, streaming + blocking compaction paths, CompactionStartedEvent, CompactionResult, needs_compaction_path, and store_compacted_turn.
API model extensions
src/models/common/responses/responses_api_params.py, src/models/common/responses/contexts.py, src/models/common/turn_summary.py
Adds omit_conversation to ResponsesApiParams to drop conversation parameter, compacted_original_input to ResponsesContext to preserve original user input when compaction rewrites/drops conversation, and output_items to TurnSummary to capture structured LLM outputs for compacted persistence.
Query endpoint integration
src/app/endpoints/query.py
Calls apply_compaction_blocking after prepare_responses_params, replaces params with compaction result, passes original_input into retrieve_response, and persists compacted turns via store_compacted_turn when conversation was omitted.
Responses endpoint integration
src/app/endpoints/responses.py
Applies silent compaction for eligible /v1/responses requests, records compacted_original_input in ResponsesContext, and updates blocked/previous-turn persistence to prefer the original input; _append_previous_response_turn no-ops when api_params.store is false.
Streaming query path with compaction events
src/app/endpoints/streaming_query.py
Adds needs_compaction_path preflight; generate_response_with_compaction emits SSE compaction started event before summarization, streams compaction progress, rewrites params to explicit input, converts create-time HTTPExceptions to SSE error events, and persists compacted turns post-stream using structured output_items.
A2A executor integration
src/app/endpoints/a2a.py
Applies blocking compaction after prepare_responses_params in A2A streaming path, extends stream→event conversion to accept LLM client and compacted original input, and persists compacted turns on response completion.
Unit tests
tests/unit/utils/test_conversation_compaction.py, tests/unit/app/endpoints/test_a2a.py, tests/unit/app/endpoints/test_responses.py, tests/unit/app/endpoints/test_streaming_query.py
Adds extensive tests for compaction helpers, needs_compaction_path gating, blocking/streaming compaction behavior, cache precedence/folding/persistence, endpoint invocation order, and compacted-mode persistence semantics (including structured outputs and interruption/failure cases).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • lightspeed-core/lightspeed-stack#1795: Adds cache-layer APIs (store_summary, get_summaries, replace_summaries) relied upon by this PR's conversation compaction caching and folding logic.

Suggested reviewers

  • tisnik
  • are-ces
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly identifies the main change: integrating conversation compaction into the query flow, which aligns with the primary objective of wiring compaction into the request pipeline across multiple endpoints.
Docstring Coverage ✅ Passed Docstring coverage is 93.48% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread src/utils/conversation_compaction.py Fixed
Introduce runtime conversation compaction (Option A): once a conversation
approaches the model's context window, lightspeed-stack summarizes older
turns and owns the LLM context itself instead of letting Llama Stack reload
the full history.

- src/utils/conversation_compaction.py: apply_compaction() async generator
  and apply_compaction_blocking() wrapper. Holds a per-conversation lock
  (R11), estimates tokens (LCORE-1569), partitions and summarizes old turns
  (LCORE-1570), writes the summary into the conversation as a marker item,
  and rebuilds the request as explicit input (summaries + recent verbatim
  turns + new query). Marker items track the boundary; the conversation_id
  is preserved and the full history stays in Llama Stack items for audit.
- models/common/responses/responses_api_params.py: omit_conversation flag so
  the conversation parameter is dropped from the request body in compacted
  mode while remaining on the object for identity.
- configuration.py: AppConfig.compaction accessor.
- app/endpoints/query.py: apply compaction after preparing params; in
  compacted mode store the completed turn against the original user query
  (the conversation parameter is no longer sent, so Llama Stack does not
  persist the turn automatically).

Background: the spec's original marker-keeps-conversation-parameter approach
was found unimplementable on llama-stack 0.6.0, which always reloads the full
conversation history when the conversation parameter is set. This restores
the spike's original explicit-input approach.
Cover marker detection and boundary selection, explicit-input assembly, the
trigger threshold, the disabled / no-context-window / existing-marker /
triggered paths of apply_compaction, the streaming CompactionStartedEvent
ordering, and compacted-turn storage.
The A2A executor uses the same prepare_responses_params + Responses API
flow as /v1/query and persists conversation_id for multi-turn contexts, so
it accumulates context and must compact too.

- Run apply_compaction_blocking before responses.create (A2A is not a
  browser SSE stream, so no progress event is emitted).
- In compacted mode, persist the completed turn from the response.completed
  stream event, since the conversation parameter is no longer sent and Llama
  Stack therefore does not store the turn automatically.
…oint

Stream /v1/streaming_query through the compaction-aware path only when the
conversation actually compacts, so non-compacting requests are unaffected
(byte-for-byte the existing flow, including HTTP error handling).

- conversation_compaction: add needs_compaction_path(), a cheap pre-stream
  predicate (no LLM, no lock) that is true only when the conversation already
  has a summary marker or would trigger a new compaction.
- streaming_query: when the predicate is true, stream via the new
  generate_response_with_compaction(), which emits the compaction progress
  event before the summarization LLM call (R12) and creates the response
  inside the stream, surfacing create-time errors as SSE error events.
  generate_response gains emit_start/compacted parameters and, in compacted
  mode, appends the completed turn to the conversation (the conversation
  parameter is not sent, so Llama Stack does not store it automatically).
- a2a: silence too-many-lines after the earlier compaction wiring.
Cover needs_compaction_path: disabled, existing-marker, over-threshold, and
under-threshold — the gate that keeps non-compacting requests on the
unchanged streaming path.
/v1/responses is the OpenAI-compatible Responses API, so compaction is
silent: no custom SSE event is injected (preserving wire compatibility) and
create-time error handling is unchanged. Summarization runs before the
response is created, on both the streaming and non-streaming paths.

- responses_endpoint_handler: run apply_compaction_blocking before the
  streaming/non-streaming split, gated to stateful single-conversation
  requests (store=True, a conversation present, no previous_response_id).
- ResponsesContext: carry compacted_original_input so the finalization can
  store the turn against the original user input.
- _append_previous_response_turn: generalized to also append the turn in
  compacted mode (the conversation parameter is dropped, so Llama Stack does
  not store the turn automatically) using the original input.
Verify _append_previous_response_turn stores the turn against the original
input in compacted mode, and stores nothing when store is disabled.
Revise R10, R12, the architecture flow, the changed-request-flow section, and
the implementation guidance to match what was built: in compacted mode
lightspeed-stack builds explicit input and omits the Llama Stack conversation
parameter (which always reloads full history), preserving conversation_id and
the full item history. Record the redesign and the four affected endpoints
(query, streaming_query, A2A, /v1/responses) in a new Changelog section.
… warning)

The explicit compacted input was assembled as plain dicts, which produced
PydanticSerializationUnexpectedValue warnings when ResponsesApiParams was
dumped (its input field is typed ResponseInput). Build the summary, recent
verbatim, and query items as typed OpenAIResponseMessage objects instead.

Verified end-to-end against a live stack: the serializer warning is gone and
compaction still triggers, preserves conversation identity, and recalls
earlier context correctly.
apply_compaction_blocking asserted that the generator yielded a result. Under
python -O asserts are stripped, so the guard would vanish and a None result
could propagate to callers. Replace it with an explicit None check that raises
RuntimeError.

Clears a GitHub code-scanning (CodeQL) "use of assert" finding. The repository's
Bandit configuration skips B101, so this only surfaced via code scanning, not
the Bandit CI job.
Make the conversation summary cache the preferred source of truth for
compaction summaries and the home of the persisted recursive fold.

- apply_compaction / apply_compaction_blocking gain cache + user_id +
  skip_user_id_check. Summaries are read from the cache (get_summaries) and each
  new chunk is written to it (store_summary); the Llama Stack marker texts remain
  an authoritative fallback when no persisting cache is configured (marker-only
  mode, additive summaries, no fold).
- When the persisted summaries themselves exceed the threshold, they are folded
  via recursively_resummarize and the fold is persisted with replace_summaries,
  so it is computed once and reused rather than recomputed per request (R3).
- configured_conversation_cache() resolves the configured cache (or None) for
  the endpoints.
- Wired into /v1/query, /v1/streaming_query, and /v1/responses. The A2A executor
  stays marker-only: it has no resolved user_id for the (user_id, conversation_id)
  cache key.

Adds 7 unit tests: cache-preferred reads, store-on-compaction, fold trigger and
persistence, no-fold-without-cache, marker fallback, and the cache resolver.
@max-svistunov max-svistunov force-pushed the lcore-1572-integrate-compaction-query-flow branch from 3473bb9 to 79dd34d Compare May 26, 2026 12:58
@max-svistunov
Copy link
Copy Markdown
Contributor Author

@tisnik Could you PTAL? Three design decisions worth noting:

  1. We prefer cache, and fall back to marker. Summaries are read from the LCORE-1571 cache when a persisting backend (SQLite or PostgreSQL) is configured, which enables the recursive fold (R3); otherwise the runtime falls back to the authoritative Llama Stack markers. This avoids regressing deployments with a non-persisting or absent cache, where a cache-only approach would fail.
  2. A2A operates in marker-only mode. It has no resolved user_id for the cache key, so it passes no cache and does not fold.
  3. The fold reuses threshold_ratio, same as the non-recursive summarization. It triggers when the summaries' token count exceeds context_window × threshold_ratio, rather than adding a separate parameter.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/app/endpoints/responses.py (1)

485-518: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Carry compacted_original_input for every explicit-input request, including blocked ones.

apply_compaction_blocking() can rewrite the request when reusing an existing summary marker, not just when it creates a fresh one. Because Lines 502-503 only set compacted_original_input on compaction.summarized, _append_previous_response_turn() misses marker-reuse requests, and shield-blocked requests still persist api_params.input — which is the synthetic summaries-plus-recent payload after compaction. Populate this whenever the request omits conversation, and have blocked-turn persistence prefer it too.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/app/endpoints/responses.py` around lines 485 - 518, The code only sets
compacted_original_input when compaction.summarized is true, so requests that
reused an existing summary marker (or were shield-blocked) keep api_params.input
(the synthetic summary+recent payload); update the post-compaction assignment in
the ResponsesContext construction to set compacted_original_input whenever
apply_compaction_blocking() returns an original_input (i.e., if
compaction.original_input is present) regardless of compaction.summarized, and
update _append_previous_response_turn() so blocked-turn persistence prefers
ResponsesContext.compacted_original_input over api_params.input when the request
omits conversation; this ensures marker-reuse and blocked requests carry the
true original_input.
src/app/endpoints/query.py (1)

205-226: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Refresh Azure credentials before running compaction.

apply_compaction_blocking() can make its own Llama Stack summarization call. For Azure-backed models, the refresh now happens only afterward, so an expired token will fail compaction even though the normal response path would have refreshed it. Move the Azure refresh block above Lines 208-217, or refresh inside the compaction path as well.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/app/endpoints/query.py` around lines 205 - 226, The Azure token refresh
happens after apply_compaction_blocking, which can itself call Llama Stack and
will fail with expired Azure tokens; move the Azure refresh check (use
AzureEntraIDManager().is_entra_id_configured,
AzureEntraIDManager().is_token_expired and
AzureEntraIDManager().refresh_token()) to run before calling
apply_compaction_blocking (or alternatively ensure apply_compaction_blocking
refreshes via the same AzureEntraIDManager path), and if refresh succeeds call
AsyncLlamaStackClientHolder().update_azure_token() to replace client before
invoking apply_compaction_blocking (references: apply_compaction_blocking,
AzureEntraIDManager, AsyncLlamaStackClientHolder, responses_params.model).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/app/endpoints/a2a.py`:
- Around line 418-425: The current call to _convert_stream_to_events passes
compaction.original_input only when compaction.summarized is true, so
reused-summary (explicit-input/omit-conversation) flows that have
compaction.original_input but compaction.summarized==False don't get the
compacted turn persisted; change the gating to use the presence of
compaction.original_input (not compaction.summarized). Update the async for call
and any subsequent persistence logic to pass and check compaction.original_input
(e.g., pass compaction.original_input directly to _convert_stream_to_events and
persist the completed A2A turn whenever compaction.original_input is non-null)
so explicit-input reuse flows store the compacted original turn.

In `@src/app/endpoints/query.py`:
- Around line 229-235: The code currently passes original_input to
retrieve_response based on compaction.summarized, which misses cases where a
cached summary was reused but conversation was omitted; change the logic to key
original_input (and thus enable store_compacted_turn()) off the compaction's
"explicit-input / conversation omitted" flag instead of compaction.summarized:
pass original_input=compaction.original_input if compaction indicates
conversation was omitted (e.g. compaction.explicit_input_mode or
compaction.omits_conversation) else None, and ensure downstream logic in
retrieve_response/store_compacted_turn() relies on that explicit-input marker
rather than compaction.summarized.

In `@src/app/endpoints/streaming_query.py`:
- Around line 820-835: The compacted-path currently calls
append_turn_to_conversation with only query_request.query and
turn_summary.llm_response, which flattens outputs and drops attachments/non-text
items and can duplicate shield refusals; replace this by constructing and
persisting the same structured compacted-turn object that the non-compacted path
(the helper used by retrieve_response_generator) produces, and pass that
structured turn into append_turn_to_conversation so it preserves attachments,
non-text outputs and shield-block metadata (use the same helper the
retrieve_response_generator path uses rather than sending plain strings from
context.query_request.query and turn_summary.llm_response).
- Around line 637-699: The compacted flag must be driven by whether the request
actually entered explicit-input mode rather than the CompactionResult.summarized
value; change the logic around compacted (currently initialized before
apply_compaction and overwritten from item.summarized) to derive compacted from
the incoming request parameters (responses_params) that indicate explicit-input
mode (e.g. responses_params.explicit_input or
responses_params.explicit_input_mode) and only fall back to item.summarized when
that request-level flag is absent; update the block that handles
CompactionResult from apply_compaction and the value passed into
generate_response(generator, ..., compacted=...) accordingly so explicit-input
requests always set compacted=True.

In `@src/models/common/responses/contexts.py`:
- Around line 59-66: Replace the legacy typing Optional[...] with modern union
syntax for the field compacted_original_input: change its annotation from
Optional[ResponseInput] to ResponseInput | None in the declaration of
compacted_original_input (the Field(...) definition referencing ResponseInput),
ensuring the type hint uses the pipe form everywhere this field is defined.

In `@src/utils/conversation_compaction.py`:
- Around line 147-151: is_marker_item currently treats any message whose text
starts with MARKER_SENTINEL as a compaction marker, allowing user content to
spoof markers and cause premature context trimming; update is_marker_item (and
the boundary-check logic that relies on it) to only accept markers that are
tamper-resistant by checking a dedicated, non-user-writable attribute or
structured field on the item (e.g., a metadata flag like
item.get("compaction_marker") or a system-only role), and also require the
sentinel to be the entire marker token (not just a prefix) and/or validate a
marker version/signature if available; ensure you keep the is_message_item check
and reference MARKER_SENTINEL, but only return True when the item has the
protected marker attribute and the exact sentinel (or validated signature) is
present so user messages cannot forge markers.
- Around line 397-399: The compaction routine only calls estimate_tokens when
original_input is a str, so list-form ResponseInput items are skipped; update
the logic around original_input (used with estimate_tokens and encoding_name
that updates estimated) to handle list and dict forms: if original_input is a
list, iterate and sum estimate_tokens for each element (recursively handle
elements that are str or dict with text/content fields), and if it's a dict
(ResponseInput), extract the relevant text field(s) and estimate those tokens;
apply the same fix in the second occurrence that updates estimated (the block
mirrored around the other estimate_tokens call).

---

Outside diff comments:
In `@src/app/endpoints/query.py`:
- Around line 205-226: The Azure token refresh happens after
apply_compaction_blocking, which can itself call Llama Stack and will fail with
expired Azure tokens; move the Azure refresh check (use
AzureEntraIDManager().is_entra_id_configured,
AzureEntraIDManager().is_token_expired and
AzureEntraIDManager().refresh_token()) to run before calling
apply_compaction_blocking (or alternatively ensure apply_compaction_blocking
refreshes via the same AzureEntraIDManager path), and if refresh succeeds call
AsyncLlamaStackClientHolder().update_azure_token() to replace client before
invoking apply_compaction_blocking (references: apply_compaction_blocking,
AzureEntraIDManager, AsyncLlamaStackClientHolder, responses_params.model).

In `@src/app/endpoints/responses.py`:
- Around line 485-518: The code only sets compacted_original_input when
compaction.summarized is true, so requests that reused an existing summary
marker (or were shield-blocked) keep api_params.input (the synthetic
summary+recent payload); update the post-compaction assignment in the
ResponsesContext construction to set compacted_original_input whenever
apply_compaction_blocking() returns an original_input (i.e., if
compaction.original_input is present) regardless of compaction.summarized, and
update _append_previous_response_turn() so blocked-turn persistence prefers
ResponsesContext.compacted_original_input over api_params.input when the request
omits conversation; this ensures marker-reuse and blocked requests carry the
true original_input.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a2e60d83-a115-4796-bfe8-53dec5894ee7

📥 Commits

Reviewing files that changed from the base of the PR and between 6c0733b and 79dd34d.

📒 Files selected for processing (12)
  • docs/design/conversation-compaction/conversation-compaction.md
  • src/app/endpoints/a2a.py
  • src/app/endpoints/query.py
  • src/app/endpoints/responses.py
  • src/app/endpoints/streaming_query.py
  • src/configuration.py
  • src/models/common/responses/contexts.py
  • src/models/common/responses/responses_api_params.py
  • src/utils/conversation_compaction.py
  • tests/unit/app/endpoints/test_a2a.py
  • tests/unit/app/endpoints/test_responses.py
  • tests/unit/utils/test_conversation_compaction.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
  • GitHub Check: E2E: library mode / ci / group 3
  • GitHub Check: E2E: server mode / ci / group 1
  • GitHub Check: E2E: server mode / ci / group 3
  • GitHub Check: E2E: library mode / ci / group 1
  • GitHub Check: E2E: server mode / ci / group 2
  • GitHub Check: E2E: library mode / ci / group 2
🧰 Additional context used
📓 Path-based instructions (5)
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Llama Stack imports: Use from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules must start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
All functions must have complete type annotations for parameters and return types, use modern syntax (str | int), and include descriptive docstrings
Use snake_case with descriptive, action-oriented names for functions (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying function parameters
Use async def for I/O operations and external API calls
Use standard log levels with clear purposes: debug() for diagnostic info, info() for program execution, warning() for unexpected events, error() for serious problems
All classes must have descriptive docstrings explaining purpose and use PascalCase with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Abstract classes must use ABC with @abstractmethod decorators
Follow Google Python docstring conventions with required sections: Parameters, Returns, Raises, and Attributes for classes

Files:

  • src/models/common/responses/responses_api_params.py
  • src/configuration.py
  • src/models/common/responses/contexts.py
  • src/app/endpoints/query.py
  • src/app/endpoints/responses.py
  • src/app/endpoints/a2a.py
  • src/utils/conversation_compaction.py
  • src/app/endpoints/streaming_query.py
src/models/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Pydantic models must use @model_validator and @field_validator for validation and complete type annotations for all attributes, avoiding Any type

Files:

  • src/models/common/responses/responses_api_params.py
  • src/models/common/responses/contexts.py
src/**/configuration.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/configuration.py: All config models must extend ConfigurationBase with extra="forbid" to reject unknown fields
Use @field_validator and @model_validator for custom validation in Pydantic models

Files:

  • src/configuration.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Use pytest.mark.asyncio marker for async tests

Files:

  • tests/unit/app/endpoints/test_a2a.py
  • tests/unit/app/endpoints/test_responses.py
  • tests/unit/utils/test_conversation_compaction.py
src/app/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/app/**/*.py: FastAPI dependencies: Import from fastapi module for APIRouter, HTTPException, Request, status, Depends
Use FastAPI HTTPException with appropriate status codes for API endpoints and handle APIConnectionError from Llama Stack

Files:

  • src/app/endpoints/query.py
  • src/app/endpoints/responses.py
  • src/app/endpoints/a2a.py
  • src/app/endpoints/streaming_query.py
🧠 Learnings (4)
📚 Learning: 2026-01-12T10:58:40.230Z
Learnt from: blublinsky
Repo: lightspeed-core/lightspeed-stack PR: 972
File: src/models/config.py:459-513
Timestamp: 2026-01-12T10:58:40.230Z
Learning: In lightspeed-core/lightspeed-stack, for Python files under src/models, when a user claims a fix is done but the issue persists, verify the current code state before accepting the fix. Steps: review the diff, fetch the latest changes, run relevant tests, reproduce the issue, search the codebase for lingering references to the original problem, confirm the fix is applied and not undone by subsequent commits, and validate with local checks to ensure the issue is resolved.

Applied to files:

  • src/models/common/responses/responses_api_params.py
  • src/models/common/responses/contexts.py
📚 Learning: 2026-02-25T07:46:33.545Z
Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 1211
File: src/models/responses.py:8-16
Timestamp: 2026-02-25T07:46:33.545Z
Learning: In the Python codebase, requests.py should use OpenAIResponseInputTool as Tool while responses.py uses OpenAIResponseTool as Tool. This difference is intentional due to differing schemas for input vs output tools in llama-stack-api. Apply this distinction consistently to other models under src/models (e.g., ensure request-related tools use the InputTool variant and response-related tools use the ResponseTool variant). If adding new tools, choose the corresponding InputTool or Tool class based on whether the tool represents input or output, and document the rationale in code comments.

Applied to files:

  • src/models/common/responses/responses_api_params.py
  • src/models/common/responses/contexts.py
📚 Learning: 2026-01-14T09:37:51.612Z
Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 988
File: src/app/endpoints/query.py:319-339
Timestamp: 2026-01-14T09:37:51.612Z
Learning: In the lightspeed-stack repository, when provider_id == "azure", the Azure provider with provider_type "remote::azure" is guaranteed to be present in the providers list. Therefore, avoid defensive StopIteration handling for next() when locating the Azure provider in providers within src/app/endpoints/query.py. This change applies specifically to this file (or nearby provider lookup code) and relies on the invariant that the Azure provider exists; if the invariant could be violated, keep the existing StopIteration handling.

Applied to files:

  • src/app/endpoints/query.py
📚 Learning: 2026-04-06T20:18:07.852Z
Learnt from: major
Repo: lightspeed-core/lightspeed-stack PR: 1463
File: src/app/endpoints/rlsapi_v1.py:266-271
Timestamp: 2026-04-06T20:18:07.852Z
Learning: In the lightspeed-stack codebase, within `src/app/endpoints/` inference/MCP endpoints, treat `tools: Optional[list[Any]]` in MCP tool definitions as an intentional, consistent typing pattern (used across `query`, `responses`, `streaming_query`, `rlsapi_v1`). Do not raise or suggest this as a typing issue during code review; changing it in isolation could break endpoint typing consistency across the codebase.

Applied to files:

  • src/app/endpoints/query.py
  • src/app/endpoints/responses.py
  • src/app/endpoints/a2a.py
  • src/app/endpoints/streaming_query.py
🔇 Additional comments (6)
src/models/common/responses/responses_api_params.py (1)

124-146: LGTM!

src/configuration.py (1)

21-21: LGTM!

Also applies to: 323-336

tests/unit/utils/test_conversation_compaction.py (1)

1-517: LGTM!

tests/unit/app/endpoints/test_a2a.py (1)

804-869: LGTM!

tests/unit/app/endpoints/test_responses.py (2)

23-23: LGTM!


2789-2836: LGTM!

Comment thread src/app/endpoints/a2a.py
Comment thread src/app/endpoints/query.py
Comment thread src/app/endpoints/streaming_query.py
Comment thread src/app/endpoints/streaming_query.py
Comment on lines +59 to +66
compacted_original_input: Optional[ResponseInput] = Field(
default=None,
description="Set only when conversation compaction (LCORE-1572) rewrote "
"the request: the original user input before the explicit-input "
"rewrite. When present, the completed turn is appended to the "
"conversation using this input, since the conversation parameter was "
"dropped and Llama Stack therefore does not store the turn.",
)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Use modern union syntax for type annotations.

The field uses Optional[ResponseInput] instead of the modern ResponseInput | None syntax. As per coding guidelines, type annotations should use modern syntax like str | int rather than Optional or Union.

♻️ Proposed fix to use modern syntax
-    compacted_original_input: Optional[ResponseInput] = Field(
+    compacted_original_input: ResponseInput | None = Field(
         default=None,
         description="Set only when conversation compaction (LCORE-1572) rewrote "
         "the request: the original user input before the explicit-input "
         "rewrite. When present, the completed turn is appended to the "
         "conversation using this input, since the conversation parameter was "
         "dropped and Llama Stack therefore does not store the turn.",
     )

As per coding guidelines, all functions must have complete type annotations using modern syntax (str | int).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/models/common/responses/contexts.py` around lines 59 - 66, Replace the
legacy typing Optional[...] with modern union syntax for the field
compacted_original_input: change its annotation from Optional[ResponseInput] to
ResponseInput | None in the declaration of compacted_original_input (the
Field(...) definition referencing ResponseInput), ensuring the type hint uses
the pipe form everywhere this field is defined.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declining this one. This repository's coding guidelines (CLAUDE.md) specify Optional[Type] for optionals, and the codebase uses Optional[...] consistently (including throughout conversation_compaction.py). Changing this single field to ResponseInput | None would be inconsistent with the surrounding style.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Resolve the CodeRabbit comment r3303871929 on PR `#1796`
gh api \
  --method PUT \
  -H "Accept: application/vnd.github+json" \
  /repos/lightspeed-core/lightspeed-stack/pulls/1796/comments/r3303871929 \
  2>/dev/null || true

# List my review comments on the PR to find the right one
gh api \
  -H "Accept: application/vnd.github+json" \
  "/repos/lightspeed-core/lightspeed-stack/pulls/1796/comments" \
  --jq '.[] | select(.user.login == "coderabbitai[bot]") | {id: .id, path: .path, line: .line, body: (.body | .[0:80])}'

Length of output: 2166


🏁 Script executed:

#!/bin/bash
# Try to resolve the pull request review thread via GraphQL
# First, find the thread ID for comment 3303871929
gh api graphql -f query='
{
  repository(owner: "lightspeed-core", name: "lightspeed-stack") {
    pullRequest(number: 1796) {
      reviewThreads(first: 100) {
        nodes {
          id
          isResolved
          comments(first: 1) {
            nodes {
              databaseId
              body
            }
          }
        }
      }
    }
  }
}' --jq '.data.repository.pullRequest.reviewThreads.nodes[] | select(.comments.nodes[0].databaseId == 3303871929) | {threadId: .id, isResolved: .isResolved}'

Length of output: 741


🏁 Script executed:

#!/bin/bash
# Resolve the review thread
gh api graphql -f query='
mutation {
  resolveReviewThread(input: {threadId: "PRRT_kwDOOiwXDs6Ey9Uc"}) {
    thread {
      id
      isResolved
    }
  }
}'

Length of output: 705



@max-svistunov, acknowledged and noted — the Optional[Type] style will be respected in future reviews for this repository. I wasn't able to auto-resolve the thread due to integration permissions, so feel free to resolve it manually when convenient.


🧠 Learnings used
Learnt from: blublinsky
Repo: lightspeed-core/lightspeed-stack PR: 972
File: src/models/config.py:459-513
Timestamp: 2026-01-12T10:58:40.230Z
Learning: In lightspeed-core/lightspeed-stack, for Python files under src/models, when a user claims a fix is done but the issue persists, verify the current code state before accepting the fix. Steps: review the diff, fetch the latest changes, run relevant tests, reproduce the issue, search the codebase for lingering references to the original problem, confirm the fix is applied and not undone by subsequent commits, and validate with local checks to ensure the issue is resolved.

Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 1211
File: src/models/responses.py:8-16
Timestamp: 2026-02-25T07:46:33.545Z
Learning: In the Python codebase, requests.py should use OpenAIResponseInputTool as Tool while responses.py uses OpenAIResponseTool as Tool. This difference is intentional due to differing schemas for input vs output tools in llama-stack-api. Apply this distinction consistently to other models under src/models (e.g., ensure request-related tools use the InputTool variant and response-related tools use the ResponseTool variant). If adding new tools, choose the corresponding InputTool or Tool class based on whether the tool represents input or output, and document the rationale in code comments.

Comment on lines +147 to +151
def is_marker_item(item: Any) -> bool:
"""Return True when *item* is a compaction summary marker message."""
if not is_message_item(item):
return False
return extract_message_text(item).startswith(MARKER_SENTINEL)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Make marker identification tamper-resistant.

Line 147 currently treats any message starting with MARKER_SENTINEL as a compaction marker. A user can forge that prefix in normal input, and then Line 159 will shift the boundary after the forged message, dropping prior context from explicit-input assembly.

Also applies to: 159-169

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/utils/conversation_compaction.py` around lines 147 - 151, is_marker_item
currently treats any message whose text starts with MARKER_SENTINEL as a
compaction marker, allowing user content to spoof markers and cause premature
context trimming; update is_marker_item (and the boundary-check logic that
relies on it) to only accept markers that are tamper-resistant by checking a
dedicated, non-user-writable attribute or structured field on the item (e.g., a
metadata flag like item.get("compaction_marker") or a system-only role), and
also require the sentinel to be the entire marker token (not just a prefix)
and/or validate a marker version/signature if available; ensure you keep the
is_message_item check and reference MARKER_SENTINEL, but only return True when
the item has the protected marker attribute and the exact sentinel (or validated
signature) is present so user messages cannot forge markers.

Comment thread src/utils/conversation_compaction.py Outdated
…ity rename

- Count tokens for list-form ResponseInput (e.g. /v1/responses), not only the
  string form, so compaction is not skipped on large item-list inputs that
  could otherwise still hit HTTP 413. Adds _estimate_response_input_tokens and a
  regression test.
- Rename CompactionResult.summarized to compacted: the flag means "served in
  compacted / explicit-input mode" (set whenever the conversation has any
  summary, reused or fresh), not "a summary was created this request". The old
  name caused reviewer confusion about turn-persistence gating, which is correct
  as written.
…bit lightspeed-core#4)

In compacted mode the streaming endpoint persisted the completed turn as
flattened strings via append_turn_to_conversation, dropping attachments and
non-text output items, and double-storing for shield-blocked requests. Persist
the structured turn instead:

- Capture the response's structured output items onto TurnSummary.output_items
  (set at response.completed, and to the refusal item on a shield block).
- generate_response now takes original_input and persists via store_compacted_turn
  with the original input plus structured output items, matching the /v1/query
  and A2A paths.
- The shield-blocked branch no longer stores the turn when the conversation
  parameter was omitted (compacted mode); generate_response stores it once with
  the correct original input, avoiding the duplicate refusal turn.

Adds tests for the structured compacted persistence and the shield dedup
(compacted and non-compacted).
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/app/endpoints/responses.py`:
- Around line 502-503: Blocked compacted responses are persisting the synthetic
compacted prompt (api_params.input) instead of the original user input; update
_persist_blocked_response_turn (and any call sites that append blocked turns) to
use compaction.compacted ? compaction.original_input (the
compacted_original_input variable set when compaction.compacted is true) as the
previous-response input instead of api_params.input so that
_append_previous_response_turn receives the original user input for blocked
compacted turns; locate references to compacted_original_input,
api_params.input, _persist_blocked_response_turn, and
_append_previous_response_turn and branch to prefer compacted_original_input
when present.

In `@src/app/endpoints/streaming_query.py`:
- Around line 839-847: The interrupted compacted-path currently ignores
original_input and uses responses_params.input (which can be a non-string
rewritten prompt), causing wrong turn storage and a problematic cast; update
_persist_interrupted_turn to thread original_input through to
store_compacted_turn (same way as the completed path does), use original_input
when present instead of responses_params.input, and avoid calling cast(str,
responses_params.input) for compacted responses; refer to functions/variables
_persist_interrupted_turn, store_compacted_turn, original_input,
responses_params.input, and responses_params.conversation when making the
change.
- Around line 1017-1020: generate_response() only populates
turn_summary.output_items when the stream ended with response.completed, which
causes compacted-mode persistence to lose partial artifacts for
response.failed/response.incomplete; update the logic that sets
turn_summary.output_items = list(latest_response_object.output or []) so it runs
for any terminal response (e.g., when response.status in {completed, failed,
incomplete} or a generic response.terminal/terminal flag) rather than only on
response.completed, ensuring failed/incomplete terminals preserve their
output_items for compacted persistence.

In `@src/models/common/turn_summary.py`:
- Around line 111-115: The output_items field currently typed as list[Any]
disables Pydantic validation; change it to a concrete list of a new Pydantic
model (e.g., OutputItem) that describes the persisted item shape, update the
Field type to list[OutputItem], and add validation using
`@field_validator`("output_items", mode="before"/"after") to coerce/validate each
list element and a `@model_validator` for the TurnSummary model to enforce
invariants before persistence; ensure the new OutputItem model defines all
required fields with types and serialization-friendly types so invalid or
non-serializable payloads are rejected at the field boundary.

In `@src/utils/conversation_compaction.py`:
- Around line 129-146: The short-circuit that sets compacted=False when
compaction is globally disabled should only disable creating new summaries, not
reuse of existing summaries/markers: change the early return on the `enabled`
flag so that it does not force `compacted=False` unconditionally; instead, when
`enabled` is False, still inspect the conversation for existing summary markers
or cached summaries and set `compacted = True` if any are present (so the code
path that omits `conversation` and uses `params` explicit input is preserved),
but skip any logic that would create new summaries; update the same logic
referenced around lines 380-382 accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e66085a9-88c3-43b3-8d6e-656d16eb7c7d

📥 Commits

Reviewing files that changed from the base of the PR and between 79dd34d and 423e093.

📒 Files selected for processing (8)
  • src/app/endpoints/a2a.py
  • src/app/endpoints/query.py
  • src/app/endpoints/responses.py
  • src/app/endpoints/streaming_query.py
  • src/models/common/turn_summary.py
  • src/utils/conversation_compaction.py
  • tests/unit/app/endpoints/test_streaming_query.py
  • tests/unit/utils/test_conversation_compaction.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (4)
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Llama Stack imports: Use from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules must start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
All functions must have complete type annotations for parameters and return types, use modern syntax (str | int), and include descriptive docstrings
Use snake_case with descriptive, action-oriented names for functions (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying function parameters
Use async def for I/O operations and external API calls
Use standard log levels with clear purposes: debug() for diagnostic info, info() for program execution, warning() for unexpected events, error() for serious problems
All classes must have descriptive docstrings explaining purpose and use PascalCase with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Abstract classes must use ABC with @abstractmethod decorators
Follow Google Python docstring conventions with required sections: Parameters, Returns, Raises, and Attributes for classes

Files:

  • src/app/endpoints/a2a.py
  • src/app/endpoints/query.py
  • src/models/common/turn_summary.py
  • src/utils/conversation_compaction.py
  • src/app/endpoints/responses.py
  • src/app/endpoints/streaming_query.py
src/app/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/app/**/*.py: FastAPI dependencies: Import from fastapi module for APIRouter, HTTPException, Request, status, Depends
Use FastAPI HTTPException with appropriate status codes for API endpoints and handle APIConnectionError from Llama Stack

Files:

  • src/app/endpoints/a2a.py
  • src/app/endpoints/query.py
  • src/app/endpoints/responses.py
  • src/app/endpoints/streaming_query.py
src/models/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Pydantic models must use @model_validator and @field_validator for validation and complete type annotations for all attributes, avoiding Any type

Files:

  • src/models/common/turn_summary.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Use pytest.mark.asyncio marker for async tests

Files:

  • tests/unit/app/endpoints/test_streaming_query.py
  • tests/unit/utils/test_conversation_compaction.py
🧠 Learnings (4)
📚 Learning: 2026-04-06T20:18:07.852Z
Learnt from: major
Repo: lightspeed-core/lightspeed-stack PR: 1463
File: src/app/endpoints/rlsapi_v1.py:266-271
Timestamp: 2026-04-06T20:18:07.852Z
Learning: In the lightspeed-stack codebase, within `src/app/endpoints/` inference/MCP endpoints, treat `tools: Optional[list[Any]]` in MCP tool definitions as an intentional, consistent typing pattern (used across `query`, `responses`, `streaming_query`, `rlsapi_v1`). Do not raise or suggest this as a typing issue during code review; changing it in isolation could break endpoint typing consistency across the codebase.

Applied to files:

  • src/app/endpoints/a2a.py
  • src/app/endpoints/query.py
  • src/app/endpoints/responses.py
  • src/app/endpoints/streaming_query.py
📚 Learning: 2026-01-14T09:37:51.612Z
Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 988
File: src/app/endpoints/query.py:319-339
Timestamp: 2026-01-14T09:37:51.612Z
Learning: In the lightspeed-stack repository, when provider_id == "azure", the Azure provider with provider_type "remote::azure" is guaranteed to be present in the providers list. Therefore, avoid defensive StopIteration handling for next() when locating the Azure provider in providers within src/app/endpoints/query.py. This change applies specifically to this file (or nearby provider lookup code) and relies on the invariant that the Azure provider exists; if the invariant could be violated, keep the existing StopIteration handling.

Applied to files:

  • src/app/endpoints/query.py
📚 Learning: 2026-01-12T10:58:40.230Z
Learnt from: blublinsky
Repo: lightspeed-core/lightspeed-stack PR: 972
File: src/models/config.py:459-513
Timestamp: 2026-01-12T10:58:40.230Z
Learning: In lightspeed-core/lightspeed-stack, for Python files under src/models, when a user claims a fix is done but the issue persists, verify the current code state before accepting the fix. Steps: review the diff, fetch the latest changes, run relevant tests, reproduce the issue, search the codebase for lingering references to the original problem, confirm the fix is applied and not undone by subsequent commits, and validate with local checks to ensure the issue is resolved.

Applied to files:

  • src/models/common/turn_summary.py
📚 Learning: 2026-02-25T07:46:33.545Z
Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 1211
File: src/models/responses.py:8-16
Timestamp: 2026-02-25T07:46:33.545Z
Learning: In the Python codebase, requests.py should use OpenAIResponseInputTool as Tool while responses.py uses OpenAIResponseTool as Tool. This difference is intentional due to differing schemas for input vs output tools in llama-stack-api. Apply this distinction consistently to other models under src/models (e.g., ensure request-related tools use the InputTool variant and response-related tools use the ResponseTool variant). If adding new tools, choose the corresponding InputTool or Tool class based on whether the tool represents input or output, and document the rationale in code comments.

Applied to files:

  • src/models/common/turn_summary.py
🔇 Additional comments (6)
tests/unit/app/endpoints/test_streaming_query.py (1)

874-875: LGTM!

Also applies to: 891-906, 907-950, 1215-1279

tests/unit/utils/test_conversation_compaction.py (1)

6-6: LGTM!

Also applies to: 133-133, 152-152, 183-183, 220-220, 257-257, 476-476, 520-533

src/utils/conversation_compaction.py (1)

313-322: LGTM!

Also applies to: 405-412, 562-565

src/app/endpoints/query.py (1)

229-235: LGTM!

src/app/endpoints/streaming_query.py (1)

79-88: LGTM!

Also applies to: 374-384, 645-708, 720-743

src/app/endpoints/a2a.py (1)

418-425: LGTM!

Comment thread src/app/endpoints/responses.py
Comment thread src/app/endpoints/streaming_query.py
Comment thread src/app/endpoints/streaming_query.py
Comment thread src/models/common/turn_summary.py Outdated
Comment on lines +129 to +146
``compacted`` is True, ``params.input`` is an explicit item list
(summaries + recent turns + new query) and the ``conversation``
parameter is omitted from the request body.
compacted: Whether the request is served in compacted (explicit-input)
mode — i.e. the conversation has at least one summary (from a marker
or the cache), so the ``conversation`` parameter is omitted. This is
True whether the summary was created this request or reused from a
prior one. Drives ``context_status``.
original_input: The new user query exactly as it arrived (before the
explicit-input rewrite). Populated only in compacted mode (where
``compacted`` is True); ``None`` otherwise. In compacted mode the
caller must append this plus the LLM output to the conversation
items itself, since the ``conversation`` parameter is no longer
passed to Llama Stack.
"""

params: ResponsesApiParams
compacted: bool
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve explicit-input mode for conversations that were already compacted.

compacted is now documented as "this conversation already has summaries, so omit conversation", but the new enabled short-circuit returns compacted=False before you inspect markers or cached summaries. If ops disables compaction after a conversation was compacted once, the next request falls back to full-history replay, which can reintroduce the 413 path and send marker text back through the model context. Gate only new summarization on enabled; reuse of existing summaries still needs the compacted path.

Also applies to: 380-382

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/utils/conversation_compaction.py` around lines 129 - 146, The
short-circuit that sets compacted=False when compaction is globally disabled
should only disable creating new summaries, not reuse of existing
summaries/markers: change the early return on the `enabled` flag so that it does
not force `compacted=False` unconditionally; instead, when `enabled` is False,
still inspect the conversation for existing summary markers or cached summaries
and set `compacted = True` if any are present (so the code path that omits
`conversation` and uses `params` explicit input is preserved), but skip any
logic that would create new summaries; update the same logic referenced around
lines 380-382 accordingly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declining by design. enabled: false is a full off-switch and the disabled path is a zero-overhead no-op; honoring existing summaries when disabled would require fetching conversation items on every disabled request, breaking the documented "when off, behaves exactly as before" contract. Disabling mid-conversation on an already-compacted conversation is an unsupported transition — now documented in a code comment (apply_compaction) and in the spec doc's Configuration section + Changelog (5a708c4, 597ab01).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@max-svistunov, understood — the zero-overhead contract for the disabled path is a valid design constraint, and documenting the unsupported mid-conversation disable transition in both the code and the spec is the right mitigation. I'll withdraw this suggestion.

(゚ ε ゚ )


✏️ Learnings added
Learnt from: max-svistunov
Repo: lightspeed-core/lightspeed-stack PR: 1796
File: src/utils/conversation_compaction.py:129-146
Timestamp: 2026-05-27T08:23:03.962Z
Learning: In `src/utils/conversation_compaction.py` (lightspeed-stack), `compaction.enabled: false` is intentionally a full off-switch and a zero-overhead no-op. It does **not** attempt to honor existing summary markers or cached summaries when disabled, because doing so would require fetching conversation items on every disabled request, breaking the "when off, behaves exactly as before" contract. Disabling compaction mid-conversation on an already-compacted conversation is an explicitly unsupported transition; the limitation is documented in the `apply_compaction` code comment and in the spec doc's Configuration section + Changelog.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

…is disabled

configured_conversation_cache() is evaluated eagerly as a call argument in the
query endpoint, so it ran on every request and accessed
configuration.conversation_cache unconditionally — forcing the (SQLite) cache to
initialize even when compaction is disabled. On configurations whose cache file
could not be opened that raised and returned HTTP 500, which failed the e2e
suites (where compaction is off). Return None without touching the cache when
compaction is disabled; the cache is only used by compaction on this path.

Adds a regression test.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unit/utils/test_conversation_compaction.py`:
- Around line 522-535: The test currently only asserts the return value and
doesn't fail if configured_conversation_cache() eagerly reads
configuration.conversation_cache; update the mock so any access to
configuration.conversation_cache raises—e.g., attach a PropertyMock
(unittest.mock.PropertyMock) or set a side_effect on
mock_config.conversation_cache that raises RuntimeError, then call
cc.configured_conversation_cache() and assert it returns None; this ensures
configured_conversation_cache() does not touch configuration.conversation_cache
(referencing configured_conversation_cache, mock_config, and
configuration.conversation_cache to locate the code to change).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3b4a3722-e903-4104-b0c5-dfac4b309d76

📥 Commits

Reviewing files that changed from the base of the PR and between 423e093 and a82f9ff.

📒 Files selected for processing (2)
  • src/utils/conversation_compaction.py
  • tests/unit/utils/test_conversation_compaction.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: build-pr
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
  • GitHub Check: E2E: library mode / ci / group 2
  • GitHub Check: E2E: library mode / ci / group 3
  • GitHub Check: E2E: server mode / ci / group 2
  • GitHub Check: E2E: library mode / ci / group 1
  • GitHub Check: E2E: server mode / ci / group 3
  • GitHub Check: E2E: server mode / ci / group 1
  • GitHub Check: E2E Tests for Lightspeed Evaluation job
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Use pytest.mark.asyncio marker for async tests

Files:

  • tests/unit/utils/test_conversation_compaction.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Llama Stack imports: Use from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules must start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
All functions must have complete type annotations for parameters and return types, use modern syntax (str | int), and include descriptive docstrings
Use snake_case with descriptive, action-oriented names for functions (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying function parameters
Use async def for I/O operations and external API calls
Use standard log levels with clear purposes: debug() for diagnostic info, info() for program execution, warning() for unexpected events, error() for serious problems
All classes must have descriptive docstrings explaining purpose and use PascalCase with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Abstract classes must use ABC with @abstractmethod decorators
Follow Google Python docstring conventions with required sections: Parameters, Returns, Raises, and Attributes for classes

Files:

  • src/utils/conversation_compaction.py
🔇 Additional comments (2)
src/utils/conversation_compaction.py (1)

302-315: LGTM!

tests/unit/utils/test_conversation_compaction.py (1)

504-519: LGTM!

Also applies to: 538-551

Comment on lines +522 to +535
def test_configured_conversation_cache_none_when_compaction_disabled(
mocker: MockerFixture,
) -> None:
"""Returns None when compaction is disabled, even if a cache is configured.

Regression guard (LCORE-1572): the cache must not be initialized on every
request when compaction is off — doing so 500'd e2e requests on configs whose
SQLite cache could not be opened.
"""
mock_config = mocker.patch.object(cc, "configuration")
mock_config.compaction.enabled = False
mock_config.conversation_cache_configuration.type = "sqlite"
mock_config.conversation_cache = object()
assert cc.configured_conversation_cache() is None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make the regression test fail on eager cache access.

This only checks the return value. Since configuration is a mock, configured_conversation_cache() can still read configuration.conversation_cache and the test will keep passing, so it does not actually guard the reported regression.

🔍 Tighten the regression guard
+from types import SimpleNamespace
 from typing import Any, cast
@@
 def test_configured_conversation_cache_none_when_compaction_disabled(
     mocker: MockerFixture,
 ) -> None:
@@
-    mock_config = mocker.patch.object(cc, "configuration")
-    mock_config.compaction.enabled = False
-    mock_config.conversation_cache_configuration.type = "sqlite"
-    mock_config.conversation_cache = object()
+    class _Config:
+        compaction = SimpleNamespace(enabled=False)
+        conversation_cache_configuration = SimpleNamespace(type="sqlite")
+
+        `@property`
+        def conversation_cache(self) -> object:
+            raise AssertionError("conversation_cache should not be accessed")
+
+    mocker.patch.object(cc, "configuration", _Config())
     assert cc.configured_conversation_cache() is None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/utils/test_conversation_compaction.py` around lines 522 - 535, The
test currently only asserts the return value and doesn't fail if
configured_conversation_cache() eagerly reads configuration.conversation_cache;
update the mock so any access to configuration.conversation_cache raises—e.g.,
attach a PropertyMock (unittest.mock.PropertyMock) or set a side_effect on
mock_config.conversation_cache that raises RuntimeError, then call
cc.configured_conversation_cache() and assert it returns None; this ensures
configured_conversation_cache() does not touch configuration.conversation_cache
(referencing configured_conversation_cache, mock_config, and
configuration.conversation_cache to locate the code to change).

…ges)

Follow-ups to the streaming-persistence work, all for non-happy-path terminals in
compacted mode (conversation parameter omitted), so the persisted turn uses the
original user input + structured output rather than the explicit rewrite or
flattened strings:

- /v1/responses: shield-blocked turns persist against compacted_original_input,
  not api_params.input (the explicit rewrite).
- streaming: interrupted (CancelledError) turns thread original_input through the
  interrupt callback and persist structured items, fixing the wrong-input storage
  and the cast(str, input) break on list inputs.
- streaming: capture output_items on response.failed / response.incomplete
  terminals too, not only response.completed, so compacted persistence keeps
  partial output.
- TurnSummary.output_items typed as list[OpenAIResponseOutput] instead of list[Any].

Also documents that disabling compaction mid-conversation on an already-compacted
conversation reverts it to full-history replay (unsupported transition); the
enabled flag stays a full off-switch (CodeRabbit E, declined by design).

Adds unit tests for the blocked /responses path, the interrupted compacted path,
and output_items capture on a failed terminal.
…f-truth, persisted fold)

The spec still described the earlier design (cache as a parallel/best-effort
layer, markers as the summary source). Update Summary storage, Additive
summarization, and Changed request flow to the as-built design, and add a
Changelog entry: the cache is the preferred source of truth for summaries (marker
texts as fallback + audit/boundary), the recursive fold is persisted via
replace_summaries (in-memory fold rejected), A2A is marker-only, and the
enabled flag stays a full off-switch.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/app/endpoints/streaming_query.py (1)

660-680: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor MEDIA_TYPE_TEXT in the compaction wrapper.

generate_response_with_compaction() now bypasses the endpoint's text-mode formatters and unconditionally emits SSE JSON for the compaction-progress event and the HTTPException path. A text/plain client will therefore receive raw data: {...} frames whenever compaction runs or create-time validation fails.

🩹 Minimal fix sketch
             if isinstance(item, CompactionStartedEvent):
-                yield stream_compaction_event(context.conversation_id)
+                if media_type != MEDIA_TYPE_TEXT:
+                    yield stream_compaction_event(context.conversation_id)
...
-    except HTTPException as e:
-        yield _http_exception_stream_event(e)
+    except HTTPException as e:
+        yield _http_exception_stream_event(e, media_type)
-def _http_exception_stream_event(exc: HTTPException) -> str:
+def _http_exception_stream_event(
+    exc: HTTPException, media_type: str = MEDIA_TYPE_JSON
+) -> str:
     detail = (
         exc.detail if isinstance(exc.detail, dict) else {"response": str(exc.detail)}
     )
+    if media_type == MEDIA_TYPE_TEXT:
+        return (
+            f"Status: {exc.status_code} - "
+            f"{detail.get('response', '')} - {detail.get('cause')}"
+        )
     return format_stream_data(
         {"event": "error", "data": {"status_code": exc.status_code, **detail}}
     )

Also applies to: 691-693

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/app/endpoints/streaming_query.py` around lines 660 - 680, The compaction
wrapper in generate_response_with_compaction currently emits SSE JSON (via
stream_compaction_event and the error/HTTPException paths) regardless of the
request media_type, causing text/plain clients to receive raw JSON SSE frames;
update generate_response_with_compaction (the loop using apply_compaction and
the exception handling) to honor MEDIA_TYPE_TEXT vs MEDIA_TYPE_JSON by branching
on context.query_request.media_type (or media_type variable) and emitting the
appropriate formatter for compaction progress and error events (e.g., call a
text-mode equivalent of stream_compaction_event or adapt stream_compaction_event
to accept media_type), and ensure the HTTPException path similarly uses the text
formatter when media_type == MEDIA_TYPE_TEXT so text clients get plain-text
frames instead of JSON SSE frames.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unit/app/endpoints/test_streaming_query.py`:
- Line 2978: The long module/test docstring "Interrupted compacted turn persists
the original input, not the explicit rewrite (LCORE-1572)." exceeds pylint's
100-character limit; replace it with a wrapped multi-line docstring under 100
chars per line by splitting the sentence across lines (e.g., two lines) so each
line is <=100 characters and preserve the same text and ticket reference.

---

Outside diff comments:
In `@src/app/endpoints/streaming_query.py`:
- Around line 660-680: The compaction wrapper in
generate_response_with_compaction currently emits SSE JSON (via
stream_compaction_event and the error/HTTPException paths) regardless of the
request media_type, causing text/plain clients to receive raw JSON SSE frames;
update generate_response_with_compaction (the loop using apply_compaction and
the exception handling) to honor MEDIA_TYPE_TEXT vs MEDIA_TYPE_JSON by branching
on context.query_request.media_type (or media_type variable) and emitting the
appropriate formatter for compaction progress and error events (e.g., call a
text-mode equivalent of stream_compaction_event or adapt stream_compaction_event
to accept media_type), and ensure the HTTPException path similarly uses the text
formatter when media_type == MEDIA_TYPE_TEXT so text clients get plain-text
frames instead of JSON SSE frames.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4ffd1f74-a87a-4832-8a3b-7402dd8b8b5c

📥 Commits

Reviewing files that changed from the base of the PR and between a82f9ff and 5a708c4.

📒 Files selected for processing (7)
  • docs/design/conversation-compaction/conversation-compaction.md
  • src/app/endpoints/responses.py
  • src/app/endpoints/streaming_query.py
  • src/models/common/turn_summary.py
  • src/utils/conversation_compaction.py
  • tests/unit/app/endpoints/test_responses.py
  • tests/unit/app/endpoints/test_streaming_query.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: build-pr
  • GitHub Check: E2E: library mode / ci / group 2
  • GitHub Check: E2E: server mode / ci / group 3
  • GitHub Check: E2E: server mode / ci / group 1
  • GitHub Check: E2E Tests for Lightspeed Evaluation job
  • GitHub Check: E2E: library mode / ci / group 3
  • GitHub Check: E2E: server mode / ci / group 2
  • GitHub Check: E2E: library mode / ci / group 1
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
🧰 Additional context used
📓 Path-based instructions (4)
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Llama Stack imports: Use from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules must start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
All functions must have complete type annotations for parameters and return types, use modern syntax (str | int), and include descriptive docstrings
Use snake_case with descriptive, action-oriented names for functions (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying function parameters
Use async def for I/O operations and external API calls
Use standard log levels with clear purposes: debug() for diagnostic info, info() for program execution, warning() for unexpected events, error() for serious problems
All classes must have descriptive docstrings explaining purpose and use PascalCase with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Abstract classes must use ABC with @abstractmethod decorators
Follow Google Python docstring conventions with required sections: Parameters, Returns, Raises, and Attributes for classes

Files:

  • src/models/common/turn_summary.py
  • src/app/endpoints/responses.py
  • src/app/endpoints/streaming_query.py
  • src/utils/conversation_compaction.py
src/models/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Pydantic models must use @model_validator and @field_validator for validation and complete type annotations for all attributes, avoiding Any type

Files:

  • src/models/common/turn_summary.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Use pytest.mark.asyncio marker for async tests

Files:

  • tests/unit/app/endpoints/test_responses.py
  • tests/unit/app/endpoints/test_streaming_query.py
src/app/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/app/**/*.py: FastAPI dependencies: Import from fastapi module for APIRouter, HTTPException, Request, status, Depends
Use FastAPI HTTPException with appropriate status codes for API endpoints and handle APIConnectionError from Llama Stack

Files:

  • src/app/endpoints/responses.py
  • src/app/endpoints/streaming_query.py
🧠 Learnings (3)
📚 Learning: 2026-01-12T10:58:40.230Z
Learnt from: blublinsky
Repo: lightspeed-core/lightspeed-stack PR: 972
File: src/models/config.py:459-513
Timestamp: 2026-01-12T10:58:40.230Z
Learning: In lightspeed-core/lightspeed-stack, for Python files under src/models, when a user claims a fix is done but the issue persists, verify the current code state before accepting the fix. Steps: review the diff, fetch the latest changes, run relevant tests, reproduce the issue, search the codebase for lingering references to the original problem, confirm the fix is applied and not undone by subsequent commits, and validate with local checks to ensure the issue is resolved.

Applied to files:

  • src/models/common/turn_summary.py
📚 Learning: 2026-02-25T07:46:33.545Z
Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 1211
File: src/models/responses.py:8-16
Timestamp: 2026-02-25T07:46:33.545Z
Learning: In the Python codebase, requests.py should use OpenAIResponseInputTool as Tool while responses.py uses OpenAIResponseTool as Tool. This difference is intentional due to differing schemas for input vs output tools in llama-stack-api. Apply this distinction consistently to other models under src/models (e.g., ensure request-related tools use the InputTool variant and response-related tools use the ResponseTool variant). If adding new tools, choose the corresponding InputTool or Tool class based on whether the tool represents input or output, and document the rationale in code comments.

Applied to files:

  • src/models/common/turn_summary.py
📚 Learning: 2026-04-06T20:18:07.852Z
Learnt from: major
Repo: lightspeed-core/lightspeed-stack PR: 1463
File: src/app/endpoints/rlsapi_v1.py:266-271
Timestamp: 2026-04-06T20:18:07.852Z
Learning: In the lightspeed-stack codebase, within `src/app/endpoints/` inference/MCP endpoints, treat `tools: Optional[list[Any]]` in MCP tool definitions as an intentional, consistent typing pattern (used across `query`, `responses`, `streaming_query`, `rlsapi_v1`). Do not raise or suggest this as a typing issue during code review; changing it in isolation could break endpoint typing consistency across the codebase.

Applied to files:

  • src/app/endpoints/responses.py
  • src/app/endpoints/streaming_query.py
🪛 GitHub Actions: Python linter / 0_Pylinter.txt
tests/unit/app/endpoints/test_streaming_query.py

[error] 2978-2978: pylint (C0301): Line too long (104/100) (line-too-long)

🪛 GitHub Actions: Python linter / Pylinter
tests/unit/app/endpoints/test_streaming_query.py

[error] 2978-2978: pylint reported C0301: Line too long (104/100) (line-too-long).

🔇 Additional comments (5)
src/models/common/turn_summary.py (1)

8-8: LGTM!

Also applies to: 112-116

docs/design/conversation-compaction/conversation-compaction.md (1)

289-290: LGTM!

tests/unit/app/endpoints/test_responses.py (1)

25-25: LGTM!

Also applies to: 2840-2869

src/app/endpoints/responses.py (1)

65-65: LGTM!

Also applies to: 233-245, 493-512, 526-526

src/app/endpoints/streaming_query.py (1)

481-524: LGTM!

Also applies to: 566-603, 770-815, 1061-1063

Comment thread tests/unit/app/endpoints/test_streaming_query.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants