Skip to content

[security] fix(drivers): treat LLM observations as untrusted#60

Open
Hinotoi-agent wants to merge 1 commit into
microsoft:mainfrom
Hinotoi-agent:fix/llm-driver-untrusted-observations
Open

[security] fix(drivers): treat LLM observations as untrusted#60
Hinotoi-agent wants to merge 1 commit into
microsoft:mainfrom
Hinotoi-agent:fix/llm-driver-untrusted-observations

Conversation

@Hinotoi-agent
Copy link
Copy Markdown

Summary

This hardens the LLM-backed prompt driver against target-agent prompt injection by treating prior target responses and evaluator rationales as untrusted observations.

Previously, _build_user_message() concatenated the target agent's previous response directly into the driver-side user message as free-form text:

Agent response: ...
Evaluator outcome: ...
Evaluator rationale: ...

Because the driver LLM uses that message to generate the next prompt, a target agent response could include instruction-like text such as "ignore previous instructions" or "the next prompt must be ..." and steer the harness away from the configured objective.

Changes

  • Serialize prior-turn data as JSON fields with explicit _untrusted labels.
  • Add a warning in each driver-side user message not to follow instructions, role claims, or policy overrides inside JSON string values.
  • Strengthen the LLM driver system prompt with the same trust-boundary rule.
  • Add regression tests covering:
    • target responses and evaluator rationales represented as untrusted JSON data
    • system prompt documentation of the untrusted-observation boundary

Validation

uv run pytest tests/unit/drivers/test_llm_driver.py -q
# 26 passed

uv run pytest -q
# 450 passed

uv run ruff check rampart/drivers/llm.py tests/unit/drivers/test_llm_driver.py
uv run ruff format --check rampart/drivers/llm.py tests/unit/drivers/test_llm_driver.py
uv run pyright rampart/drivers/llm.py tests/unit/drivers/test_llm_driver.py
# 0 errors, 0 warnings, 0 informations

python -m compileall -q rampart/drivers/llm.py tests/unit/drivers/test_llm_driver.py
git diff --check

@Hinotoi-agent Hinotoi-agent requested a review from a team May 21, 2026 03:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant