LCORE-1572: Integrate conversation compaction into the query flow#1796
LCORE-1572: Integrate conversation compaction into the query flow#1796max-svistunov wants to merge 19 commits into
Conversation
|
Warning Review limit reached
More reviews will be available in 30 minutes and 17 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughThis PR implements explicit-input conversation compaction across query, responses, streaming, and A2A endpoints. The system omits the Responses API ChangesConversation Compaction Feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Introduce runtime conversation compaction (Option A): once a conversation approaches the model's context window, lightspeed-stack summarizes older turns and owns the LLM context itself instead of letting Llama Stack reload the full history. - src/utils/conversation_compaction.py: apply_compaction() async generator and apply_compaction_blocking() wrapper. Holds a per-conversation lock (R11), estimates tokens (LCORE-1569), partitions and summarizes old turns (LCORE-1570), writes the summary into the conversation as a marker item, and rebuilds the request as explicit input (summaries + recent verbatim turns + new query). Marker items track the boundary; the conversation_id is preserved and the full history stays in Llama Stack items for audit. - models/common/responses/responses_api_params.py: omit_conversation flag so the conversation parameter is dropped from the request body in compacted mode while remaining on the object for identity. - configuration.py: AppConfig.compaction accessor. - app/endpoints/query.py: apply compaction after preparing params; in compacted mode store the completed turn against the original user query (the conversation parameter is no longer sent, so Llama Stack does not persist the turn automatically). Background: the spec's original marker-keeps-conversation-parameter approach was found unimplementable on llama-stack 0.6.0, which always reloads the full conversation history when the conversation parameter is set. This restores the spike's original explicit-input approach.
Cover marker detection and boundary selection, explicit-input assembly, the trigger threshold, the disabled / no-context-window / existing-marker / triggered paths of apply_compaction, the streaming CompactionStartedEvent ordering, and compacted-turn storage.
The A2A executor uses the same prepare_responses_params + Responses API flow as /v1/query and persists conversation_id for multi-turn contexts, so it accumulates context and must compact too. - Run apply_compaction_blocking before responses.create (A2A is not a browser SSE stream, so no progress event is emitted). - In compacted mode, persist the completed turn from the response.completed stream event, since the conversation parameter is no longer sent and Llama Stack therefore does not store the turn automatically.
…oint Stream /v1/streaming_query through the compaction-aware path only when the conversation actually compacts, so non-compacting requests are unaffected (byte-for-byte the existing flow, including HTTP error handling). - conversation_compaction: add needs_compaction_path(), a cheap pre-stream predicate (no LLM, no lock) that is true only when the conversation already has a summary marker or would trigger a new compaction. - streaming_query: when the predicate is true, stream via the new generate_response_with_compaction(), which emits the compaction progress event before the summarization LLM call (R12) and creates the response inside the stream, surfacing create-time errors as SSE error events. generate_response gains emit_start/compacted parameters and, in compacted mode, appends the completed turn to the conversation (the conversation parameter is not sent, so Llama Stack does not store it automatically). - a2a: silence too-many-lines after the earlier compaction wiring.
Cover needs_compaction_path: disabled, existing-marker, over-threshold, and under-threshold — the gate that keeps non-compacting requests on the unchanged streaming path.
/v1/responses is the OpenAI-compatible Responses API, so compaction is silent: no custom SSE event is injected (preserving wire compatibility) and create-time error handling is unchanged. Summarization runs before the response is created, on both the streaming and non-streaming paths. - responses_endpoint_handler: run apply_compaction_blocking before the streaming/non-streaming split, gated to stateful single-conversation requests (store=True, a conversation present, no previous_response_id). - ResponsesContext: carry compacted_original_input so the finalization can store the turn against the original user input. - _append_previous_response_turn: generalized to also append the turn in compacted mode (the conversation parameter is dropped, so Llama Stack does not store the turn automatically) using the original input.
Verify _append_previous_response_turn stores the turn against the original input in compacted mode, and stores nothing when store is disabled.
Revise R10, R12, the architecture flow, the changed-request-flow section, and the implementation guidance to match what was built: in compacted mode lightspeed-stack builds explicit input and omits the Llama Stack conversation parameter (which always reloads full history), preserving conversation_id and the full item history. Record the redesign and the four affected endpoints (query, streaming_query, A2A, /v1/responses) in a new Changelog section.
… warning) The explicit compacted input was assembled as plain dicts, which produced PydanticSerializationUnexpectedValue warnings when ResponsesApiParams was dumped (its input field is typed ResponseInput). Build the summary, recent verbatim, and query items as typed OpenAIResponseMessage objects instead. Verified end-to-end against a live stack: the serializer warning is gone and compaction still triggers, preserves conversation identity, and recalls earlier context correctly.
apply_compaction_blocking asserted that the generator yielded a result. Under python -O asserts are stripped, so the guard would vanish and a None result could propagate to callers. Replace it with an explicit None check that raises RuntimeError. Clears a GitHub code-scanning (CodeQL) "use of assert" finding. The repository's Bandit configuration skips B101, so this only surfaced via code scanning, not the Bandit CI job.
Make the conversation summary cache the preferred source of truth for compaction summaries and the home of the persisted recursive fold. - apply_compaction / apply_compaction_blocking gain cache + user_id + skip_user_id_check. Summaries are read from the cache (get_summaries) and each new chunk is written to it (store_summary); the Llama Stack marker texts remain an authoritative fallback when no persisting cache is configured (marker-only mode, additive summaries, no fold). - When the persisted summaries themselves exceed the threshold, they are folded via recursively_resummarize and the fold is persisted with replace_summaries, so it is computed once and reused rather than recomputed per request (R3). - configured_conversation_cache() resolves the configured cache (or None) for the endpoints. - Wired into /v1/query, /v1/streaming_query, and /v1/responses. The A2A executor stays marker-only: it has no resolved user_id for the (user_id, conversation_id) cache key. Adds 7 unit tests: cache-preferred reads, store-on-compaction, fold trigger and persistence, no-fold-without-cache, marker fallback, and the cache resolver.
3473bb9 to
79dd34d
Compare
|
@tisnik Could you PTAL? Three design decisions worth noting:
|
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/app/endpoints/responses.py (1)
485-518:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCarry
compacted_original_inputfor every explicit-input request, including blocked ones.
apply_compaction_blocking()can rewrite the request when reusing an existing summary marker, not just when it creates a fresh one. Because Lines 502-503 only setcompacted_original_inputoncompaction.summarized,_append_previous_response_turn()misses marker-reuse requests, and shield-blocked requests still persistapi_params.input— which is the synthetic summaries-plus-recent payload after compaction. Populate this whenever the request omitsconversation, and have blocked-turn persistence prefer it too.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/app/endpoints/responses.py` around lines 485 - 518, The code only sets compacted_original_input when compaction.summarized is true, so requests that reused an existing summary marker (or were shield-blocked) keep api_params.input (the synthetic summary+recent payload); update the post-compaction assignment in the ResponsesContext construction to set compacted_original_input whenever apply_compaction_blocking() returns an original_input (i.e., if compaction.original_input is present) regardless of compaction.summarized, and update _append_previous_response_turn() so blocked-turn persistence prefers ResponsesContext.compacted_original_input over api_params.input when the request omits conversation; this ensures marker-reuse and blocked requests carry the true original_input.src/app/endpoints/query.py (1)
205-226:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRefresh Azure credentials before running compaction.
apply_compaction_blocking()can make its own Llama Stack summarization call. For Azure-backed models, the refresh now happens only afterward, so an expired token will fail compaction even though the normal response path would have refreshed it. Move the Azure refresh block above Lines 208-217, or refresh inside the compaction path as well.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/app/endpoints/query.py` around lines 205 - 226, The Azure token refresh happens after apply_compaction_blocking, which can itself call Llama Stack and will fail with expired Azure tokens; move the Azure refresh check (use AzureEntraIDManager().is_entra_id_configured, AzureEntraIDManager().is_token_expired and AzureEntraIDManager().refresh_token()) to run before calling apply_compaction_blocking (or alternatively ensure apply_compaction_blocking refreshes via the same AzureEntraIDManager path), and if refresh succeeds call AsyncLlamaStackClientHolder().update_azure_token() to replace client before invoking apply_compaction_blocking (references: apply_compaction_blocking, AzureEntraIDManager, AsyncLlamaStackClientHolder, responses_params.model).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/app/endpoints/a2a.py`:
- Around line 418-425: The current call to _convert_stream_to_events passes
compaction.original_input only when compaction.summarized is true, so
reused-summary (explicit-input/omit-conversation) flows that have
compaction.original_input but compaction.summarized==False don't get the
compacted turn persisted; change the gating to use the presence of
compaction.original_input (not compaction.summarized). Update the async for call
and any subsequent persistence logic to pass and check compaction.original_input
(e.g., pass compaction.original_input directly to _convert_stream_to_events and
persist the completed A2A turn whenever compaction.original_input is non-null)
so explicit-input reuse flows store the compacted original turn.
In `@src/app/endpoints/query.py`:
- Around line 229-235: The code currently passes original_input to
retrieve_response based on compaction.summarized, which misses cases where a
cached summary was reused but conversation was omitted; change the logic to key
original_input (and thus enable store_compacted_turn()) off the compaction's
"explicit-input / conversation omitted" flag instead of compaction.summarized:
pass original_input=compaction.original_input if compaction indicates
conversation was omitted (e.g. compaction.explicit_input_mode or
compaction.omits_conversation) else None, and ensure downstream logic in
retrieve_response/store_compacted_turn() relies on that explicit-input marker
rather than compaction.summarized.
In `@src/app/endpoints/streaming_query.py`:
- Around line 820-835: The compacted-path currently calls
append_turn_to_conversation with only query_request.query and
turn_summary.llm_response, which flattens outputs and drops attachments/non-text
items and can duplicate shield refusals; replace this by constructing and
persisting the same structured compacted-turn object that the non-compacted path
(the helper used by retrieve_response_generator) produces, and pass that
structured turn into append_turn_to_conversation so it preserves attachments,
non-text outputs and shield-block metadata (use the same helper the
retrieve_response_generator path uses rather than sending plain strings from
context.query_request.query and turn_summary.llm_response).
- Around line 637-699: The compacted flag must be driven by whether the request
actually entered explicit-input mode rather than the CompactionResult.summarized
value; change the logic around compacted (currently initialized before
apply_compaction and overwritten from item.summarized) to derive compacted from
the incoming request parameters (responses_params) that indicate explicit-input
mode (e.g. responses_params.explicit_input or
responses_params.explicit_input_mode) and only fall back to item.summarized when
that request-level flag is absent; update the block that handles
CompactionResult from apply_compaction and the value passed into
generate_response(generator, ..., compacted=...) accordingly so explicit-input
requests always set compacted=True.
In `@src/models/common/responses/contexts.py`:
- Around line 59-66: Replace the legacy typing Optional[...] with modern union
syntax for the field compacted_original_input: change its annotation from
Optional[ResponseInput] to ResponseInput | None in the declaration of
compacted_original_input (the Field(...) definition referencing ResponseInput),
ensuring the type hint uses the pipe form everywhere this field is defined.
In `@src/utils/conversation_compaction.py`:
- Around line 147-151: is_marker_item currently treats any message whose text
starts with MARKER_SENTINEL as a compaction marker, allowing user content to
spoof markers and cause premature context trimming; update is_marker_item (and
the boundary-check logic that relies on it) to only accept markers that are
tamper-resistant by checking a dedicated, non-user-writable attribute or
structured field on the item (e.g., a metadata flag like
item.get("compaction_marker") or a system-only role), and also require the
sentinel to be the entire marker token (not just a prefix) and/or validate a
marker version/signature if available; ensure you keep the is_message_item check
and reference MARKER_SENTINEL, but only return True when the item has the
protected marker attribute and the exact sentinel (or validated signature) is
present so user messages cannot forge markers.
- Around line 397-399: The compaction routine only calls estimate_tokens when
original_input is a str, so list-form ResponseInput items are skipped; update
the logic around original_input (used with estimate_tokens and encoding_name
that updates estimated) to handle list and dict forms: if original_input is a
list, iterate and sum estimate_tokens for each element (recursively handle
elements that are str or dict with text/content fields), and if it's a dict
(ResponseInput), extract the relevant text field(s) and estimate those tokens;
apply the same fix in the second occurrence that updates estimated (the block
mirrored around the other estimate_tokens call).
---
Outside diff comments:
In `@src/app/endpoints/query.py`:
- Around line 205-226: The Azure token refresh happens after
apply_compaction_blocking, which can itself call Llama Stack and will fail with
expired Azure tokens; move the Azure refresh check (use
AzureEntraIDManager().is_entra_id_configured,
AzureEntraIDManager().is_token_expired and
AzureEntraIDManager().refresh_token()) to run before calling
apply_compaction_blocking (or alternatively ensure apply_compaction_blocking
refreshes via the same AzureEntraIDManager path), and if refresh succeeds call
AsyncLlamaStackClientHolder().update_azure_token() to replace client before
invoking apply_compaction_blocking (references: apply_compaction_blocking,
AzureEntraIDManager, AsyncLlamaStackClientHolder, responses_params.model).
In `@src/app/endpoints/responses.py`:
- Around line 485-518: The code only sets compacted_original_input when
compaction.summarized is true, so requests that reused an existing summary
marker (or were shield-blocked) keep api_params.input (the synthetic
summary+recent payload); update the post-compaction assignment in the
ResponsesContext construction to set compacted_original_input whenever
apply_compaction_blocking() returns an original_input (i.e., if
compaction.original_input is present) regardless of compaction.summarized, and
update _append_previous_response_turn() so blocked-turn persistence prefers
ResponsesContext.compacted_original_input over api_params.input when the request
omits conversation; this ensures marker-reuse and blocked requests carry the
true original_input.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a2e60d83-a115-4796-bfe8-53dec5894ee7
📒 Files selected for processing (12)
docs/design/conversation-compaction/conversation-compaction.mdsrc/app/endpoints/a2a.pysrc/app/endpoints/query.pysrc/app/endpoints/responses.pysrc/app/endpoints/streaming_query.pysrc/configuration.pysrc/models/common/responses/contexts.pysrc/models/common/responses/responses_api_params.pysrc/utils/conversation_compaction.pytests/unit/app/endpoints/test_a2a.pytests/unit/app/endpoints/test_responses.pytests/unit/utils/test_conversation_compaction.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
- GitHub Check: E2E: library mode / ci / group 3
- GitHub Check: E2E: server mode / ci / group 1
- GitHub Check: E2E: server mode / ci / group 3
- GitHub Check: E2E: library mode / ci / group 1
- GitHub Check: E2E: server mode / ci / group 2
- GitHub Check: E2E: library mode / ci / group 2
🧰 Additional context used
📓 Path-based instructions (5)
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use absolute imports for internal modules:from authentication import get_auth_dependency
Llama Stack imports: Usefrom llama_stack_client import AsyncLlamaStackClient
Checkconstants.pyfor shared constants before defining new ones
All modules must start with descriptive docstrings explaining purpose
Uselogger = get_logger(__name__)fromlog.pyfor module logging
All functions must have complete type annotations for parameters and return types, use modern syntax (str | int), and include descriptive docstrings
Use snake_case with descriptive, action-oriented names for functions (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying function parameters
Useasync deffor I/O operations and external API calls
Use standard log levels with clear purposes:debug()for diagnostic info,info()for program execution,warning()for unexpected events,error()for serious problems
All classes must have descriptive docstrings explaining purpose and use PascalCase with standard suffixes:Configuration,Error/Exception,Resolver,Interface
Abstract classes must use ABC with@abstractmethoddecorators
Follow Google Python docstring conventions with required sections: Parameters, Returns, Raises, and Attributes for classes
Files:
src/models/common/responses/responses_api_params.pysrc/configuration.pysrc/models/common/responses/contexts.pysrc/app/endpoints/query.pysrc/app/endpoints/responses.pysrc/app/endpoints/a2a.pysrc/utils/conversation_compaction.pysrc/app/endpoints/streaming_query.py
src/models/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Pydantic models must use
@model_validatorand@field_validatorfor validation and complete type annotations for all attributes, avoidingAnytype
Files:
src/models/common/responses/responses_api_params.pysrc/models/common/responses/contexts.py
src/**/configuration.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/configuration.py: All config models must extendConfigurationBasewithextra="forbid"to reject unknown fields
Use@field_validatorand@model_validatorfor custom validation in Pydantic models
Files:
src/configuration.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Usepytest.mark.asynciomarker for async tests
Files:
tests/unit/app/endpoints/test_a2a.pytests/unit/app/endpoints/test_responses.pytests/unit/utils/test_conversation_compaction.py
src/app/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/app/**/*.py: FastAPI dependencies: Import fromfastapimodule forAPIRouter,HTTPException,Request,status,Depends
Use FastAPIHTTPExceptionwith appropriate status codes for API endpoints and handleAPIConnectionErrorfrom Llama Stack
Files:
src/app/endpoints/query.pysrc/app/endpoints/responses.pysrc/app/endpoints/a2a.pysrc/app/endpoints/streaming_query.py
🧠 Learnings (4)
📚 Learning: 2026-01-12T10:58:40.230Z
Learnt from: blublinsky
Repo: lightspeed-core/lightspeed-stack PR: 972
File: src/models/config.py:459-513
Timestamp: 2026-01-12T10:58:40.230Z
Learning: In lightspeed-core/lightspeed-stack, for Python files under src/models, when a user claims a fix is done but the issue persists, verify the current code state before accepting the fix. Steps: review the diff, fetch the latest changes, run relevant tests, reproduce the issue, search the codebase for lingering references to the original problem, confirm the fix is applied and not undone by subsequent commits, and validate with local checks to ensure the issue is resolved.
Applied to files:
src/models/common/responses/responses_api_params.pysrc/models/common/responses/contexts.py
📚 Learning: 2026-02-25T07:46:33.545Z
Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 1211
File: src/models/responses.py:8-16
Timestamp: 2026-02-25T07:46:33.545Z
Learning: In the Python codebase, requests.py should use OpenAIResponseInputTool as Tool while responses.py uses OpenAIResponseTool as Tool. This difference is intentional due to differing schemas for input vs output tools in llama-stack-api. Apply this distinction consistently to other models under src/models (e.g., ensure request-related tools use the InputTool variant and response-related tools use the ResponseTool variant). If adding new tools, choose the corresponding InputTool or Tool class based on whether the tool represents input or output, and document the rationale in code comments.
Applied to files:
src/models/common/responses/responses_api_params.pysrc/models/common/responses/contexts.py
📚 Learning: 2026-01-14T09:37:51.612Z
Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 988
File: src/app/endpoints/query.py:319-339
Timestamp: 2026-01-14T09:37:51.612Z
Learning: In the lightspeed-stack repository, when provider_id == "azure", the Azure provider with provider_type "remote::azure" is guaranteed to be present in the providers list. Therefore, avoid defensive StopIteration handling for next() when locating the Azure provider in providers within src/app/endpoints/query.py. This change applies specifically to this file (or nearby provider lookup code) and relies on the invariant that the Azure provider exists; if the invariant could be violated, keep the existing StopIteration handling.
Applied to files:
src/app/endpoints/query.py
📚 Learning: 2026-04-06T20:18:07.852Z
Learnt from: major
Repo: lightspeed-core/lightspeed-stack PR: 1463
File: src/app/endpoints/rlsapi_v1.py:266-271
Timestamp: 2026-04-06T20:18:07.852Z
Learning: In the lightspeed-stack codebase, within `src/app/endpoints/` inference/MCP endpoints, treat `tools: Optional[list[Any]]` in MCP tool definitions as an intentional, consistent typing pattern (used across `query`, `responses`, `streaming_query`, `rlsapi_v1`). Do not raise or suggest this as a typing issue during code review; changing it in isolation could break endpoint typing consistency across the codebase.
Applied to files:
src/app/endpoints/query.pysrc/app/endpoints/responses.pysrc/app/endpoints/a2a.pysrc/app/endpoints/streaming_query.py
🔇 Additional comments (6)
src/models/common/responses/responses_api_params.py (1)
124-146: LGTM!src/configuration.py (1)
21-21: LGTM!Also applies to: 323-336
tests/unit/utils/test_conversation_compaction.py (1)
1-517: LGTM!tests/unit/app/endpoints/test_a2a.py (1)
804-869: LGTM!tests/unit/app/endpoints/test_responses.py (2)
23-23: LGTM!
2789-2836: LGTM!
| compacted_original_input: Optional[ResponseInput] = Field( | ||
| default=None, | ||
| description="Set only when conversation compaction (LCORE-1572) rewrote " | ||
| "the request: the original user input before the explicit-input " | ||
| "rewrite. When present, the completed turn is appended to the " | ||
| "conversation using this input, since the conversation parameter was " | ||
| "dropped and Llama Stack therefore does not store the turn.", | ||
| ) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Use modern union syntax for type annotations.
The field uses Optional[ResponseInput] instead of the modern ResponseInput | None syntax. As per coding guidelines, type annotations should use modern syntax like str | int rather than Optional or Union.
♻️ Proposed fix to use modern syntax
- compacted_original_input: Optional[ResponseInput] = Field(
+ compacted_original_input: ResponseInput | None = Field(
default=None,
description="Set only when conversation compaction (LCORE-1572) rewrote "
"the request: the original user input before the explicit-input "
"rewrite. When present, the completed turn is appended to the "
"conversation using this input, since the conversation parameter was "
"dropped and Llama Stack therefore does not store the turn.",
)As per coding guidelines, all functions must have complete type annotations using modern syntax (str | int).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/models/common/responses/contexts.py` around lines 59 - 66, Replace the
legacy typing Optional[...] with modern union syntax for the field
compacted_original_input: change its annotation from Optional[ResponseInput] to
ResponseInput | None in the declaration of compacted_original_input (the
Field(...) definition referencing ResponseInput), ensuring the type hint uses
the pipe form everywhere this field is defined.
There was a problem hiding this comment.
Declining this one. This repository's coding guidelines (CLAUDE.md) specify Optional[Type] for optionals, and the codebase uses Optional[...] consistently (including throughout conversation_compaction.py). Changing this single field to ResponseInput | None would be inconsistent with the surrounding style.
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Resolve the CodeRabbit comment r3303871929 on PR `#1796`
gh api \
--method PUT \
-H "Accept: application/vnd.github+json" \
/repos/lightspeed-core/lightspeed-stack/pulls/1796/comments/r3303871929 \
2>/dev/null || true
# List my review comments on the PR to find the right one
gh api \
-H "Accept: application/vnd.github+json" \
"/repos/lightspeed-core/lightspeed-stack/pulls/1796/comments" \
--jq '.[] | select(.user.login == "coderabbitai[bot]") | {id: .id, path: .path, line: .line, body: (.body | .[0:80])}'Length of output: 2166
🏁 Script executed:
#!/bin/bash
# Try to resolve the pull request review thread via GraphQL
# First, find the thread ID for comment 3303871929
gh api graphql -f query='
{
repository(owner: "lightspeed-core", name: "lightspeed-stack") {
pullRequest(number: 1796) {
reviewThreads(first: 100) {
nodes {
id
isResolved
comments(first: 1) {
nodes {
databaseId
body
}
}
}
}
}
}
}' --jq '.data.repository.pullRequest.reviewThreads.nodes[] | select(.comments.nodes[0].databaseId == 3303871929) | {threadId: .id, isResolved: .isResolved}'Length of output: 741
🏁 Script executed:
#!/bin/bash
# Resolve the review thread
gh api graphql -f query='
mutation {
resolveReviewThread(input: {threadId: "PRRT_kwDOOiwXDs6Ey9Uc"}) {
thread {
id
isResolved
}
}
}'Length of output: 705
@max-svistunov, acknowledged and noted — the Optional[Type] style will be respected in future reviews for this repository. I wasn't able to auto-resolve the thread due to integration permissions, so feel free to resolve it manually when convenient.
🧠 Learnings used
Learnt from: blublinsky
Repo: lightspeed-core/lightspeed-stack PR: 972
File: src/models/config.py:459-513
Timestamp: 2026-01-12T10:58:40.230Z
Learning: In lightspeed-core/lightspeed-stack, for Python files under src/models, when a user claims a fix is done but the issue persists, verify the current code state before accepting the fix. Steps: review the diff, fetch the latest changes, run relevant tests, reproduce the issue, search the codebase for lingering references to the original problem, confirm the fix is applied and not undone by subsequent commits, and validate with local checks to ensure the issue is resolved.
Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 1211
File: src/models/responses.py:8-16
Timestamp: 2026-02-25T07:46:33.545Z
Learning: In the Python codebase, requests.py should use OpenAIResponseInputTool as Tool while responses.py uses OpenAIResponseTool as Tool. This difference is intentional due to differing schemas for input vs output tools in llama-stack-api. Apply this distinction consistently to other models under src/models (e.g., ensure request-related tools use the InputTool variant and response-related tools use the ResponseTool variant). If adding new tools, choose the corresponding InputTool or Tool class based on whether the tool represents input or output, and document the rationale in code comments.
| def is_marker_item(item: Any) -> bool: | ||
| """Return True when *item* is a compaction summary marker message.""" | ||
| if not is_message_item(item): | ||
| return False | ||
| return extract_message_text(item).startswith(MARKER_SENTINEL) |
There was a problem hiding this comment.
Make marker identification tamper-resistant.
Line 147 currently treats any message starting with MARKER_SENTINEL as a compaction marker. A user can forge that prefix in normal input, and then Line 159 will shift the boundary after the forged message, dropping prior context from explicit-input assembly.
Also applies to: 159-169
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/utils/conversation_compaction.py` around lines 147 - 151, is_marker_item
currently treats any message whose text starts with MARKER_SENTINEL as a
compaction marker, allowing user content to spoof markers and cause premature
context trimming; update is_marker_item (and the boundary-check logic that
relies on it) to only accept markers that are tamper-resistant by checking a
dedicated, non-user-writable attribute or structured field on the item (e.g., a
metadata flag like item.get("compaction_marker") or a system-only role), and
also require the sentinel to be the entire marker token (not just a prefix)
and/or validate a marker version/signature if available; ensure you keep the
is_message_item check and reference MARKER_SENTINEL, but only return True when
the item has the protected marker attribute and the exact sentinel (or validated
signature) is present so user messages cannot forge markers.
…ity rename - Count tokens for list-form ResponseInput (e.g. /v1/responses), not only the string form, so compaction is not skipped on large item-list inputs that could otherwise still hit HTTP 413. Adds _estimate_response_input_tokens and a regression test. - Rename CompactionResult.summarized to compacted: the flag means "served in compacted / explicit-input mode" (set whenever the conversation has any summary, reused or fresh), not "a summary was created this request". The old name caused reviewer confusion about turn-persistence gating, which is correct as written.
…bit lightspeed-core#4) In compacted mode the streaming endpoint persisted the completed turn as flattened strings via append_turn_to_conversation, dropping attachments and non-text output items, and double-storing for shield-blocked requests. Persist the structured turn instead: - Capture the response's structured output items onto TurnSummary.output_items (set at response.completed, and to the refusal item on a shield block). - generate_response now takes original_input and persists via store_compacted_turn with the original input plus structured output items, matching the /v1/query and A2A paths. - The shield-blocked branch no longer stores the turn when the conversation parameter was omitted (compacted mode); generate_response stores it once with the correct original input, avoiding the duplicate refusal turn. Adds tests for the structured compacted persistence and the shield dedup (compacted and non-compacted).
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/app/endpoints/responses.py`:
- Around line 502-503: Blocked compacted responses are persisting the synthetic
compacted prompt (api_params.input) instead of the original user input; update
_persist_blocked_response_turn (and any call sites that append blocked turns) to
use compaction.compacted ? compaction.original_input (the
compacted_original_input variable set when compaction.compacted is true) as the
previous-response input instead of api_params.input so that
_append_previous_response_turn receives the original user input for blocked
compacted turns; locate references to compacted_original_input,
api_params.input, _persist_blocked_response_turn, and
_append_previous_response_turn and branch to prefer compacted_original_input
when present.
In `@src/app/endpoints/streaming_query.py`:
- Around line 839-847: The interrupted compacted-path currently ignores
original_input and uses responses_params.input (which can be a non-string
rewritten prompt), causing wrong turn storage and a problematic cast; update
_persist_interrupted_turn to thread original_input through to
store_compacted_turn (same way as the completed path does), use original_input
when present instead of responses_params.input, and avoid calling cast(str,
responses_params.input) for compacted responses; refer to functions/variables
_persist_interrupted_turn, store_compacted_turn, original_input,
responses_params.input, and responses_params.conversation when making the
change.
- Around line 1017-1020: generate_response() only populates
turn_summary.output_items when the stream ended with response.completed, which
causes compacted-mode persistence to lose partial artifacts for
response.failed/response.incomplete; update the logic that sets
turn_summary.output_items = list(latest_response_object.output or []) so it runs
for any terminal response (e.g., when response.status in {completed, failed,
incomplete} or a generic response.terminal/terminal flag) rather than only on
response.completed, ensuring failed/incomplete terminals preserve their
output_items for compacted persistence.
In `@src/models/common/turn_summary.py`:
- Around line 111-115: The output_items field currently typed as list[Any]
disables Pydantic validation; change it to a concrete list of a new Pydantic
model (e.g., OutputItem) that describes the persisted item shape, update the
Field type to list[OutputItem], and add validation using
`@field_validator`("output_items", mode="before"/"after") to coerce/validate each
list element and a `@model_validator` for the TurnSummary model to enforce
invariants before persistence; ensure the new OutputItem model defines all
required fields with types and serialization-friendly types so invalid or
non-serializable payloads are rejected at the field boundary.
In `@src/utils/conversation_compaction.py`:
- Around line 129-146: The short-circuit that sets compacted=False when
compaction is globally disabled should only disable creating new summaries, not
reuse of existing summaries/markers: change the early return on the `enabled`
flag so that it does not force `compacted=False` unconditionally; instead, when
`enabled` is False, still inspect the conversation for existing summary markers
or cached summaries and set `compacted = True` if any are present (so the code
path that omits `conversation` and uses `params` explicit input is preserved),
but skip any logic that would create new summaries; update the same logic
referenced around lines 380-382 accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: e66085a9-88c3-43b3-8d6e-656d16eb7c7d
📒 Files selected for processing (8)
src/app/endpoints/a2a.pysrc/app/endpoints/query.pysrc/app/endpoints/responses.pysrc/app/endpoints/streaming_query.pysrc/models/common/turn_summary.pysrc/utils/conversation_compaction.pytests/unit/app/endpoints/test_streaming_query.pytests/unit/utils/test_conversation_compaction.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (4)
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use absolute imports for internal modules:from authentication import get_auth_dependency
Llama Stack imports: Usefrom llama_stack_client import AsyncLlamaStackClient
Checkconstants.pyfor shared constants before defining new ones
All modules must start with descriptive docstrings explaining purpose
Uselogger = get_logger(__name__)fromlog.pyfor module logging
All functions must have complete type annotations for parameters and return types, use modern syntax (str | int), and include descriptive docstrings
Use snake_case with descriptive, action-oriented names for functions (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying function parameters
Useasync deffor I/O operations and external API calls
Use standard log levels with clear purposes:debug()for diagnostic info,info()for program execution,warning()for unexpected events,error()for serious problems
All classes must have descriptive docstrings explaining purpose and use PascalCase with standard suffixes:Configuration,Error/Exception,Resolver,Interface
Abstract classes must use ABC with@abstractmethoddecorators
Follow Google Python docstring conventions with required sections: Parameters, Returns, Raises, and Attributes for classes
Files:
src/app/endpoints/a2a.pysrc/app/endpoints/query.pysrc/models/common/turn_summary.pysrc/utils/conversation_compaction.pysrc/app/endpoints/responses.pysrc/app/endpoints/streaming_query.py
src/app/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/app/**/*.py: FastAPI dependencies: Import fromfastapimodule forAPIRouter,HTTPException,Request,status,Depends
Use FastAPIHTTPExceptionwith appropriate status codes for API endpoints and handleAPIConnectionErrorfrom Llama Stack
Files:
src/app/endpoints/a2a.pysrc/app/endpoints/query.pysrc/app/endpoints/responses.pysrc/app/endpoints/streaming_query.py
src/models/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Pydantic models must use
@model_validatorand@field_validatorfor validation and complete type annotations for all attributes, avoidingAnytype
Files:
src/models/common/turn_summary.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Usepytest.mark.asynciomarker for async tests
Files:
tests/unit/app/endpoints/test_streaming_query.pytests/unit/utils/test_conversation_compaction.py
🧠 Learnings (4)
📚 Learning: 2026-04-06T20:18:07.852Z
Learnt from: major
Repo: lightspeed-core/lightspeed-stack PR: 1463
File: src/app/endpoints/rlsapi_v1.py:266-271
Timestamp: 2026-04-06T20:18:07.852Z
Learning: In the lightspeed-stack codebase, within `src/app/endpoints/` inference/MCP endpoints, treat `tools: Optional[list[Any]]` in MCP tool definitions as an intentional, consistent typing pattern (used across `query`, `responses`, `streaming_query`, `rlsapi_v1`). Do not raise or suggest this as a typing issue during code review; changing it in isolation could break endpoint typing consistency across the codebase.
Applied to files:
src/app/endpoints/a2a.pysrc/app/endpoints/query.pysrc/app/endpoints/responses.pysrc/app/endpoints/streaming_query.py
📚 Learning: 2026-01-14T09:37:51.612Z
Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 988
File: src/app/endpoints/query.py:319-339
Timestamp: 2026-01-14T09:37:51.612Z
Learning: In the lightspeed-stack repository, when provider_id == "azure", the Azure provider with provider_type "remote::azure" is guaranteed to be present in the providers list. Therefore, avoid defensive StopIteration handling for next() when locating the Azure provider in providers within src/app/endpoints/query.py. This change applies specifically to this file (or nearby provider lookup code) and relies on the invariant that the Azure provider exists; if the invariant could be violated, keep the existing StopIteration handling.
Applied to files:
src/app/endpoints/query.py
📚 Learning: 2026-01-12T10:58:40.230Z
Learnt from: blublinsky
Repo: lightspeed-core/lightspeed-stack PR: 972
File: src/models/config.py:459-513
Timestamp: 2026-01-12T10:58:40.230Z
Learning: In lightspeed-core/lightspeed-stack, for Python files under src/models, when a user claims a fix is done but the issue persists, verify the current code state before accepting the fix. Steps: review the diff, fetch the latest changes, run relevant tests, reproduce the issue, search the codebase for lingering references to the original problem, confirm the fix is applied and not undone by subsequent commits, and validate with local checks to ensure the issue is resolved.
Applied to files:
src/models/common/turn_summary.py
📚 Learning: 2026-02-25T07:46:33.545Z
Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 1211
File: src/models/responses.py:8-16
Timestamp: 2026-02-25T07:46:33.545Z
Learning: In the Python codebase, requests.py should use OpenAIResponseInputTool as Tool while responses.py uses OpenAIResponseTool as Tool. This difference is intentional due to differing schemas for input vs output tools in llama-stack-api. Apply this distinction consistently to other models under src/models (e.g., ensure request-related tools use the InputTool variant and response-related tools use the ResponseTool variant). If adding new tools, choose the corresponding InputTool or Tool class based on whether the tool represents input or output, and document the rationale in code comments.
Applied to files:
src/models/common/turn_summary.py
🔇 Additional comments (6)
tests/unit/app/endpoints/test_streaming_query.py (1)
874-875: LGTM!Also applies to: 891-906, 907-950, 1215-1279
tests/unit/utils/test_conversation_compaction.py (1)
6-6: LGTM!Also applies to: 133-133, 152-152, 183-183, 220-220, 257-257, 476-476, 520-533
src/utils/conversation_compaction.py (1)
313-322: LGTM!Also applies to: 405-412, 562-565
src/app/endpoints/query.py (1)
229-235: LGTM!src/app/endpoints/streaming_query.py (1)
79-88: LGTM!Also applies to: 374-384, 645-708, 720-743
src/app/endpoints/a2a.py (1)
418-425: LGTM!
| ``compacted`` is True, ``params.input`` is an explicit item list | ||
| (summaries + recent turns + new query) and the ``conversation`` | ||
| parameter is omitted from the request body. | ||
| compacted: Whether the request is served in compacted (explicit-input) | ||
| mode — i.e. the conversation has at least one summary (from a marker | ||
| or the cache), so the ``conversation`` parameter is omitted. This is | ||
| True whether the summary was created this request or reused from a | ||
| prior one. Drives ``context_status``. | ||
| original_input: The new user query exactly as it arrived (before the | ||
| explicit-input rewrite). Populated only in compacted mode (where | ||
| ``compacted`` is True); ``None`` otherwise. In compacted mode the | ||
| caller must append this plus the LLM output to the conversation | ||
| items itself, since the ``conversation`` parameter is no longer | ||
| passed to Llama Stack. | ||
| """ | ||
|
|
||
| params: ResponsesApiParams | ||
| compacted: bool |
There was a problem hiding this comment.
Preserve explicit-input mode for conversations that were already compacted.
compacted is now documented as "this conversation already has summaries, so omit conversation", but the new enabled short-circuit returns compacted=False before you inspect markers or cached summaries. If ops disables compaction after a conversation was compacted once, the next request falls back to full-history replay, which can reintroduce the 413 path and send marker text back through the model context. Gate only new summarization on enabled; reuse of existing summaries still needs the compacted path.
Also applies to: 380-382
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/utils/conversation_compaction.py` around lines 129 - 146, The
short-circuit that sets compacted=False when compaction is globally disabled
should only disable creating new summaries, not reuse of existing
summaries/markers: change the early return on the `enabled` flag so that it does
not force `compacted=False` unconditionally; instead, when `enabled` is False,
still inspect the conversation for existing summary markers or cached summaries
and set `compacted = True` if any are present (so the code path that omits
`conversation` and uses `params` explicit input is preserved), but skip any
logic that would create new summaries; update the same logic referenced around
lines 380-382 accordingly.
There was a problem hiding this comment.
Declining by design. enabled: false is a full off-switch and the disabled path is a zero-overhead no-op; honoring existing summaries when disabled would require fetching conversation items on every disabled request, breaking the documented "when off, behaves exactly as before" contract. Disabling mid-conversation on an already-compacted conversation is an unsupported transition — now documented in a code comment (apply_compaction) and in the spec doc's Configuration section + Changelog (5a708c4, 597ab01).
There was a problem hiding this comment.
@max-svistunov, understood — the zero-overhead contract for the disabled path is a valid design constraint, and documenting the unsupported mid-conversation disable transition in both the code and the spec is the right mitigation. I'll withdraw this suggestion.
(゚ ε ゚ )
✏️ Learnings added
Learnt from: max-svistunov
Repo: lightspeed-core/lightspeed-stack PR: 1796
File: src/utils/conversation_compaction.py:129-146
Timestamp: 2026-05-27T08:23:03.962Z
Learning: In `src/utils/conversation_compaction.py` (lightspeed-stack), `compaction.enabled: false` is intentionally a full off-switch and a zero-overhead no-op. It does **not** attempt to honor existing summary markers or cached summaries when disabled, because doing so would require fetching conversation items on every disabled request, breaking the "when off, behaves exactly as before" contract. Disabling compaction mid-conversation on an already-compacted conversation is an explicitly unsupported transition; the limitation is documented in the `apply_compaction` code comment and in the spec doc's Configuration section + Changelog.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
…is disabled configured_conversation_cache() is evaluated eagerly as a call argument in the query endpoint, so it ran on every request and accessed configuration.conversation_cache unconditionally — forcing the (SQLite) cache to initialize even when compaction is disabled. On configurations whose cache file could not be opened that raised and returned HTTP 500, which failed the e2e suites (where compaction is off). Return None without touching the cache when compaction is disabled; the cache is only used by compaction on this path. Adds a regression test.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/unit/utils/test_conversation_compaction.py`:
- Around line 522-535: The test currently only asserts the return value and
doesn't fail if configured_conversation_cache() eagerly reads
configuration.conversation_cache; update the mock so any access to
configuration.conversation_cache raises—e.g., attach a PropertyMock
(unittest.mock.PropertyMock) or set a side_effect on
mock_config.conversation_cache that raises RuntimeError, then call
cc.configured_conversation_cache() and assert it returns None; this ensures
configured_conversation_cache() does not touch configuration.conversation_cache
(referencing configured_conversation_cache, mock_config, and
configuration.conversation_cache to locate the code to change).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 3b4a3722-e903-4104-b0c5-dfac4b309d76
📒 Files selected for processing (2)
src/utils/conversation_compaction.pytests/unit/utils/test_conversation_compaction.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: build-pr
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
- GitHub Check: E2E: library mode / ci / group 2
- GitHub Check: E2E: library mode / ci / group 3
- GitHub Check: E2E: server mode / ci / group 2
- GitHub Check: E2E: library mode / ci / group 1
- GitHub Check: E2E: server mode / ci / group 3
- GitHub Check: E2E: server mode / ci / group 1
- GitHub Check: E2E Tests for Lightspeed Evaluation job
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Usepytest.mark.asynciomarker for async tests
Files:
tests/unit/utils/test_conversation_compaction.py
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use absolute imports for internal modules:from authentication import get_auth_dependency
Llama Stack imports: Usefrom llama_stack_client import AsyncLlamaStackClient
Checkconstants.pyfor shared constants before defining new ones
All modules must start with descriptive docstrings explaining purpose
Uselogger = get_logger(__name__)fromlog.pyfor module logging
All functions must have complete type annotations for parameters and return types, use modern syntax (str | int), and include descriptive docstrings
Use snake_case with descriptive, action-oriented names for functions (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying function parameters
Useasync deffor I/O operations and external API calls
Use standard log levels with clear purposes:debug()for diagnostic info,info()for program execution,warning()for unexpected events,error()for serious problems
All classes must have descriptive docstrings explaining purpose and use PascalCase with standard suffixes:Configuration,Error/Exception,Resolver,Interface
Abstract classes must use ABC with@abstractmethoddecorators
Follow Google Python docstring conventions with required sections: Parameters, Returns, Raises, and Attributes for classes
Files:
src/utils/conversation_compaction.py
🔇 Additional comments (2)
src/utils/conversation_compaction.py (1)
302-315: LGTM!tests/unit/utils/test_conversation_compaction.py (1)
504-519: LGTM!Also applies to: 538-551
| def test_configured_conversation_cache_none_when_compaction_disabled( | ||
| mocker: MockerFixture, | ||
| ) -> None: | ||
| """Returns None when compaction is disabled, even if a cache is configured. | ||
|
|
||
| Regression guard (LCORE-1572): the cache must not be initialized on every | ||
| request when compaction is off — doing so 500'd e2e requests on configs whose | ||
| SQLite cache could not be opened. | ||
| """ | ||
| mock_config = mocker.patch.object(cc, "configuration") | ||
| mock_config.compaction.enabled = False | ||
| mock_config.conversation_cache_configuration.type = "sqlite" | ||
| mock_config.conversation_cache = object() | ||
| assert cc.configured_conversation_cache() is None |
There was a problem hiding this comment.
Make the regression test fail on eager cache access.
This only checks the return value. Since configuration is a mock, configured_conversation_cache() can still read configuration.conversation_cache and the test will keep passing, so it does not actually guard the reported regression.
🔍 Tighten the regression guard
+from types import SimpleNamespace
from typing import Any, cast
@@
def test_configured_conversation_cache_none_when_compaction_disabled(
mocker: MockerFixture,
) -> None:
@@
- mock_config = mocker.patch.object(cc, "configuration")
- mock_config.compaction.enabled = False
- mock_config.conversation_cache_configuration.type = "sqlite"
- mock_config.conversation_cache = object()
+ class _Config:
+ compaction = SimpleNamespace(enabled=False)
+ conversation_cache_configuration = SimpleNamespace(type="sqlite")
+
+ `@property`
+ def conversation_cache(self) -> object:
+ raise AssertionError("conversation_cache should not be accessed")
+
+ mocker.patch.object(cc, "configuration", _Config())
assert cc.configured_conversation_cache() is None🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/utils/test_conversation_compaction.py` around lines 522 - 535, The
test currently only asserts the return value and doesn't fail if
configured_conversation_cache() eagerly reads configuration.conversation_cache;
update the mock so any access to configuration.conversation_cache raises—e.g.,
attach a PropertyMock (unittest.mock.PropertyMock) or set a side_effect on
mock_config.conversation_cache that raises RuntimeError, then call
cc.configured_conversation_cache() and assert it returns None; this ensures
configured_conversation_cache() does not touch configuration.conversation_cache
(referencing configured_conversation_cache, mock_config, and
configuration.conversation_cache to locate the code to change).
…ges) Follow-ups to the streaming-persistence work, all for non-happy-path terminals in compacted mode (conversation parameter omitted), so the persisted turn uses the original user input + structured output rather than the explicit rewrite or flattened strings: - /v1/responses: shield-blocked turns persist against compacted_original_input, not api_params.input (the explicit rewrite). - streaming: interrupted (CancelledError) turns thread original_input through the interrupt callback and persist structured items, fixing the wrong-input storage and the cast(str, input) break on list inputs. - streaming: capture output_items on response.failed / response.incomplete terminals too, not only response.completed, so compacted persistence keeps partial output. - TurnSummary.output_items typed as list[OpenAIResponseOutput] instead of list[Any]. Also documents that disabling compaction mid-conversation on an already-compacted conversation reverts it to full-history replay (unsupported transition); the enabled flag stays a full off-switch (CodeRabbit E, declined by design). Adds unit tests for the blocked /responses path, the interrupted compacted path, and output_items capture on a failed terminal.
…pec doc (CodeRabbit E)
…f-truth, persisted fold) The spec still described the earlier design (cache as a parallel/best-effort layer, markers as the summary source). Update Summary storage, Additive summarization, and Changed request flow to the as-built design, and add a Changelog entry: the cache is the preferred source of truth for summaries (marker texts as fallback + audit/boundary), the recursive fold is persisted via replace_summaries (in-memory fold rejected), A2A is marker-only, and the enabled flag stays a full off-switch.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/app/endpoints/streaming_query.py (1)
660-680:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHonor
MEDIA_TYPE_TEXTin the compaction wrapper.
generate_response_with_compaction()now bypasses the endpoint's text-mode formatters and unconditionally emits SSE JSON for the compaction-progress event and theHTTPExceptionpath. Atext/plainclient will therefore receive rawdata: {...}frames whenever compaction runs or create-time validation fails.🩹 Minimal fix sketch
if isinstance(item, CompactionStartedEvent): - yield stream_compaction_event(context.conversation_id) + if media_type != MEDIA_TYPE_TEXT: + yield stream_compaction_event(context.conversation_id) ... - except HTTPException as e: - yield _http_exception_stream_event(e) + except HTTPException as e: + yield _http_exception_stream_event(e, media_type)-def _http_exception_stream_event(exc: HTTPException) -> str: +def _http_exception_stream_event( + exc: HTTPException, media_type: str = MEDIA_TYPE_JSON +) -> str: detail = ( exc.detail if isinstance(exc.detail, dict) else {"response": str(exc.detail)} ) + if media_type == MEDIA_TYPE_TEXT: + return ( + f"Status: {exc.status_code} - " + f"{detail.get('response', '')} - {detail.get('cause')}" + ) return format_stream_data( {"event": "error", "data": {"status_code": exc.status_code, **detail}} )Also applies to: 691-693
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/app/endpoints/streaming_query.py` around lines 660 - 680, The compaction wrapper in generate_response_with_compaction currently emits SSE JSON (via stream_compaction_event and the error/HTTPException paths) regardless of the request media_type, causing text/plain clients to receive raw JSON SSE frames; update generate_response_with_compaction (the loop using apply_compaction and the exception handling) to honor MEDIA_TYPE_TEXT vs MEDIA_TYPE_JSON by branching on context.query_request.media_type (or media_type variable) and emitting the appropriate formatter for compaction progress and error events (e.g., call a text-mode equivalent of stream_compaction_event or adapt stream_compaction_event to accept media_type), and ensure the HTTPException path similarly uses the text formatter when media_type == MEDIA_TYPE_TEXT so text clients get plain-text frames instead of JSON SSE frames.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/unit/app/endpoints/test_streaming_query.py`:
- Line 2978: The long module/test docstring "Interrupted compacted turn persists
the original input, not the explicit rewrite (LCORE-1572)." exceeds pylint's
100-character limit; replace it with a wrapped multi-line docstring under 100
chars per line by splitting the sentence across lines (e.g., two lines) so each
line is <=100 characters and preserve the same text and ticket reference.
---
Outside diff comments:
In `@src/app/endpoints/streaming_query.py`:
- Around line 660-680: The compaction wrapper in
generate_response_with_compaction currently emits SSE JSON (via
stream_compaction_event and the error/HTTPException paths) regardless of the
request media_type, causing text/plain clients to receive raw JSON SSE frames;
update generate_response_with_compaction (the loop using apply_compaction and
the exception handling) to honor MEDIA_TYPE_TEXT vs MEDIA_TYPE_JSON by branching
on context.query_request.media_type (or media_type variable) and emitting the
appropriate formatter for compaction progress and error events (e.g., call a
text-mode equivalent of stream_compaction_event or adapt stream_compaction_event
to accept media_type), and ensure the HTTPException path similarly uses the text
formatter when media_type == MEDIA_TYPE_TEXT so text clients get plain-text
frames instead of JSON SSE frames.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4ffd1f74-a87a-4832-8a3b-7402dd8b8b5c
📒 Files selected for processing (7)
docs/design/conversation-compaction/conversation-compaction.mdsrc/app/endpoints/responses.pysrc/app/endpoints/streaming_query.pysrc/models/common/turn_summary.pysrc/utils/conversation_compaction.pytests/unit/app/endpoints/test_responses.pytests/unit/app/endpoints/test_streaming_query.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: build-pr
- GitHub Check: E2E: library mode / ci / group 2
- GitHub Check: E2E: server mode / ci / group 3
- GitHub Check: E2E: server mode / ci / group 1
- GitHub Check: E2E Tests for Lightspeed Evaluation job
- GitHub Check: E2E: library mode / ci / group 3
- GitHub Check: E2E: server mode / ci / group 2
- GitHub Check: E2E: library mode / ci / group 1
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
🧰 Additional context used
📓 Path-based instructions (4)
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use absolute imports for internal modules:from authentication import get_auth_dependency
Llama Stack imports: Usefrom llama_stack_client import AsyncLlamaStackClient
Checkconstants.pyfor shared constants before defining new ones
All modules must start with descriptive docstrings explaining purpose
Uselogger = get_logger(__name__)fromlog.pyfor module logging
All functions must have complete type annotations for parameters and return types, use modern syntax (str | int), and include descriptive docstrings
Use snake_case with descriptive, action-oriented names for functions (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying function parameters
Useasync deffor I/O operations and external API calls
Use standard log levels with clear purposes:debug()for diagnostic info,info()for program execution,warning()for unexpected events,error()for serious problems
All classes must have descriptive docstrings explaining purpose and use PascalCase with standard suffixes:Configuration,Error/Exception,Resolver,Interface
Abstract classes must use ABC with@abstractmethoddecorators
Follow Google Python docstring conventions with required sections: Parameters, Returns, Raises, and Attributes for classes
Files:
src/models/common/turn_summary.pysrc/app/endpoints/responses.pysrc/app/endpoints/streaming_query.pysrc/utils/conversation_compaction.py
src/models/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Pydantic models must use
@model_validatorand@field_validatorfor validation and complete type annotations for all attributes, avoidingAnytype
Files:
src/models/common/turn_summary.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Usepytest.mark.asynciomarker for async tests
Files:
tests/unit/app/endpoints/test_responses.pytests/unit/app/endpoints/test_streaming_query.py
src/app/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/app/**/*.py: FastAPI dependencies: Import fromfastapimodule forAPIRouter,HTTPException,Request,status,Depends
Use FastAPIHTTPExceptionwith appropriate status codes for API endpoints and handleAPIConnectionErrorfrom Llama Stack
Files:
src/app/endpoints/responses.pysrc/app/endpoints/streaming_query.py
🧠 Learnings (3)
📚 Learning: 2026-01-12T10:58:40.230Z
Learnt from: blublinsky
Repo: lightspeed-core/lightspeed-stack PR: 972
File: src/models/config.py:459-513
Timestamp: 2026-01-12T10:58:40.230Z
Learning: In lightspeed-core/lightspeed-stack, for Python files under src/models, when a user claims a fix is done but the issue persists, verify the current code state before accepting the fix. Steps: review the diff, fetch the latest changes, run relevant tests, reproduce the issue, search the codebase for lingering references to the original problem, confirm the fix is applied and not undone by subsequent commits, and validate with local checks to ensure the issue is resolved.
Applied to files:
src/models/common/turn_summary.py
📚 Learning: 2026-02-25T07:46:33.545Z
Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 1211
File: src/models/responses.py:8-16
Timestamp: 2026-02-25T07:46:33.545Z
Learning: In the Python codebase, requests.py should use OpenAIResponseInputTool as Tool while responses.py uses OpenAIResponseTool as Tool. This difference is intentional due to differing schemas for input vs output tools in llama-stack-api. Apply this distinction consistently to other models under src/models (e.g., ensure request-related tools use the InputTool variant and response-related tools use the ResponseTool variant). If adding new tools, choose the corresponding InputTool or Tool class based on whether the tool represents input or output, and document the rationale in code comments.
Applied to files:
src/models/common/turn_summary.py
📚 Learning: 2026-04-06T20:18:07.852Z
Learnt from: major
Repo: lightspeed-core/lightspeed-stack PR: 1463
File: src/app/endpoints/rlsapi_v1.py:266-271
Timestamp: 2026-04-06T20:18:07.852Z
Learning: In the lightspeed-stack codebase, within `src/app/endpoints/` inference/MCP endpoints, treat `tools: Optional[list[Any]]` in MCP tool definitions as an intentional, consistent typing pattern (used across `query`, `responses`, `streaming_query`, `rlsapi_v1`). Do not raise or suggest this as a typing issue during code review; changing it in isolation could break endpoint typing consistency across the codebase.
Applied to files:
src/app/endpoints/responses.pysrc/app/endpoints/streaming_query.py
🪛 GitHub Actions: Python linter / 0_Pylinter.txt
tests/unit/app/endpoints/test_streaming_query.py
[error] 2978-2978: pylint (C0301): Line too long (104/100) (line-too-long)
🪛 GitHub Actions: Python linter / Pylinter
tests/unit/app/endpoints/test_streaming_query.py
[error] 2978-2978: pylint reported C0301: Line too long (104/100) (line-too-long).
🔇 Additional comments (5)
src/models/common/turn_summary.py (1)
8-8: LGTM!Also applies to: 112-116
docs/design/conversation-compaction/conversation-compaction.md (1)
289-290: LGTM!tests/unit/app/endpoints/test_responses.py (1)
25-25: LGTM!Also applies to: 2840-2869
src/app/endpoints/responses.py (1)
65-65: LGTM!Also applies to: 233-245, 493-512, 526-526
src/app/endpoints/streaming_query.py (1)
481-524: LGTM!Also applies to: 566-603, 770-815, 1061-1063
Description
Wires conversation compaction into the request flow so long conversations are summarized and continue working instead of failing with HTTP 413 (Conversation Compaction epic, LCORE-1631). Builds on the token estimator (LCORE-1569), the summarization module (LCORE-1570), and the summary cache (LCORE-1571).
The conversation-handling approach changed during implementation, away from what the in-repo design docs originally specified:
R10indocs/design/conversation-compaction/conversation-compaction.md, and spikeDecision 6, "option B" indocs/design/conversation-compaction/conversation-compaction-spike.md): inject the summary as a marked item into the existing Llama Stack conversation, keep passing theconversationparameter, and "select only items from the last summary marker onward."conversationparameter once a conversation is compacted.conversationparameter is set, Llama Stack always reloads the full stored message history and exposes no marker-based selection hook — so a marker item cannot limit what the model actually sees. This was verified empirically (the experiment is recorded in the spike doc). Option A is, in fact, the spike's original recommendation, which a later spike edit had replaced with the marker approach; this reinstates it.conversation-compaction.md—R10,R12, the architecture flow, and a new Changelog entry. The investigation, experiment, and decision history live inconversation-compaction-spike.md.With option A,
conversation_idis preserved throughout and the full history (including pre-compaction turns) remains in the conversation's items for UI/audit.Endpoints. Compaction is applied to every endpoint that builds context from a growing conversation and calls the Responses API:
/v1/query— non-streaming./v1/streaming_query— native SSE; emits acompactionevent before the summarization call (R12). Only conversations that actually compact take the in-stream path, so non-compacting requests are byte-for-byte unchanged (including HTTP error handling)./v1/responses— OpenAI-compatible; compacts silently (no non-standard SSE event) to preserve wire compatibility.The
/v1/rlsapiinference path is stateless (no stored conversation) and is intentionally out of scope.Key changes.
src/utils/conversation_compaction.py:apply_compaction()/apply_compaction_blocking(), theneeds_compaction_path()routing predicate, marker helpers, and a per-conversation lock (R11).ResponsesApiParams.omit_conversationflag — drops theconversationparameter from the request body in compacted mode while keeping it on the object for identity.AppConfig.compactionaccessor; endpoint wiring inquery.py,streaming_query.py,a2a.py,responses.py;ResponsesContext.compacted_original_inputfor turn storage.apply_compactionreads summaries viaget_summaries, writes each new chunk viastore_summary, and — when the summaries' own token count crosses the threshold — folds them viarecursively_resummarizeand persists the fold viareplace_summaries. Falls back to the Llama Stack marker texts when no persisting cache is configured.configured_conversation_cache()resolves the cache for the endpoints; A2A runs marker-only (nouser_idfor the cache key).Disabled by default (
compaction.enabled: false); when off, all endpoints behave exactly as before.Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Unit tests:
uv run python -m pytest tests/unit/utils/test_conversation_compaction.py tests/unit/app/endpoints/test_query.py tests/unit/app/endpoints/test_a2a.py tests/unit/app/endpoints/test_streaming_query.py tests/unit/app/endpoints/test_responses.py -q-> all pass. Full suiteuv run python -m pytest tests/unit-> 2399 passed, 1 skipped (no regressions; the compaction-disabled default leaves all endpoints unchanged).End-to-end against a live stack (Lightspeed Stack -> Llama Stack 0.6.0 -> OpenAI gpt-4o-mini), with
compaction.enabled: trueand a 2000-token window so compaction triggers after a few turns:/v1/query: input-token sawtooth across turns (~251 -> ~985 -> drops to ~734 when compaction fires), the secret planted in turn 0 was still recalled after it had been summarized out of the LLM context,conversation_ididentical across all turns, and zero 413 errors. Log confirmsutils.compaction: Summarizing N conversation items./v1/streaming_query: SSE event order wasstart -> compaction -> token -> turn_complete -> end(the compaction progress event is flushed before summarization),conversation_idunchanged.Persisted recursive fold (R3), end-to-end with a SQLite conversation cache and a small (1200-token) window so summaries cross the fold threshold: across many turns the accumulated summaries were repeatedly folded (
Folding N summaries (... tokens) for conversation ...in the log, followed byRecursively re-summarizing N existing summary chunks),replace_summariescollapsed them to a single row in theconversation_summariestable, and a secret planted in turn 0 was still recalled after surviving both summarization and folding.conversation_idstable throughout.Summary by CodeRabbit
New Features
Documentation
Tests