Skip to content

REFACTOR: unify error/blocked response scoring across scorers#1770

Merged
romanlutz merged 16 commits into
microsoft:mainfrom
romanlutz:romanlutz/unify-error-scoring
May 22, 2026
Merged

REFACTOR: unify error/blocked response scoring across scorers#1770
romanlutz merged 16 commits into
microsoft:mainfrom
romanlutz:romanlutz/unify-error-scoring

Conversation

@romanlutz
Copy link
Copy Markdown
Contributor

Description

Blocked (content-filtered) responses were being handled in three independent places that all converged on 0.0 / False by accident: the TrueFalseScorer no-pieces fallback, the FloatScaleScoreAggregator empty-list fallback, and TAP's error_score_map. Two scorers also had broken edge cases — SelfAskCategoryScorer would send error content to the LLM (likely raising InvalidJsonException or hallucinating a category), and ConversationScorer would crash with ValueError on any blocked turn.

This PR pushes blocked handling into the scorer base classes so the "correct" default behavior is what every scorer gets for free:

  • TrueFalseScorer defaults to False on blocked input (existing fallback, now load-bearing).
  • FloatScaleScorer defaults to 0.0 on blocked input (existing aggregator fallback, now load-bearing and documented).
  • SelfAskCategoryScorer's validator now rejects error pieces so it gets the same False default instead of asking the LLM about garbage.
  • ConversationScorer no longer rejects error pieces wholesale, so blocked turns mid-conversation are scored normally.
  • TAP's error_score_map is removed — it was added after the last release, was redundant with the new defaults, and only confused things further. SelfAskRefusalScorer still inverts True/False semantics on blocked content (refusal detected → True); that's intentional and the only "weird" default a user must still wrap in TrueFalseInverterScorer when using it as an objective scorer.

Tests and Documentation

  • New tests for the unified defaults in test_float_scale_threshold_scorer.py, test_scorer.py, test_self_ask_category.py, test_conversation_history_scorer.py, and the existing TAP suite.
  • TAP notebook (doc/code/executor/attack/tap_attack.{py,ipynb}) re-executed end-to-end against the real Azure OpenAI GPT-4o strict-filter target and OpenAI image target. The new "The request was blocked by the target; returning 0.0." rationale now appears in the tree output, confirming the refactor fires in a live run.
  • Class docstrings on FloatScaleScorer and TrueFalseScorer updated and converted from RST to MyST markdown so they render correctly in the jupyter-book docs site.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

romanlutz and others added 5 commits May 18, 2026 16:50
Push consistent blocked / error handling into the scorer base classes so that:

  * TrueFalseScorer returns Score(False) when no supported pieces remain

  * FloatScaleScorer returns Score(0.0) when no supported pieces remain

Semantic overrides (e.g., SelfAskRefusalScorer returning True on blocked) stay in the subclass and continue to work as before.

Changes:

* pyrit/score/float_scale/float_scale_scorer.py: add _score_async override mirroring TrueFalseScorer's no-pieces fallback. Returns Score(0.0) with a rationale distinguishing blocked / error / filtered cases.

* pyrit/score/true_false/self_ask_category_scorer.py: restrict default validator to text-only so blocked pieces are filtered out (instead of sent to the LLM as garbage).

* pyrit/score/conversation_scorer.py: relax default validator (enforce_all_pieces_valid=False) so a blocked input message no longer raises ValueError before the conversation can even be looked up.

* pyrit/executor/attack/multi_turn/tree_of_attacks.py: hard-remove error_score_map plumbing (added since the last release, so no public callers exist). TAP now relies on the unified scorer defaults to produce 0.0 on blocked responses.

* Tests + docs updated to match.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PyRIT renders docstrings as MyST markdown via jupyter-book. The class docstrings for TrueFalseScorer and FloatScaleScorer used RST-only constructs that wouldn't render correctly:

* Section headers ('Default error / blocked behavior' + '----' underline) → bold heading

* :class:'~pyrit.X.Y.Z' Sphinx cross-references → plain code-span class names

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move from-imports out of test bodies to module top, per review feedback.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Refresh tap_attack.ipynb outputs against the unified blocked-scoring path.
The new "The request was blocked by the target; returning 0.0." rationale
now appears in live TAP runs against the strict-filter target, replacing
the old error_score_map message.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The kwarg was introduced after the last release, so no client code can
still be passing it. The two guard tests only exercised Python's standard
"unknown kwarg → TypeError" behavior; the parameter's absence is already
guaranteed by the __init__ signatures.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@romanlutz romanlutz changed the title [BREAKING] REFACTOR: unify error/blocked response scoring across scorers REFACTOR: unify error/blocked response scoring across scorers May 21, 2026
@hannahwestra25 hannahwestra25 self-assigned this May 21, 2026
Comment thread pyrit/score/float_scale/float_scale_scorer.py Outdated
Comment thread pyrit/score/float_scale/float_scale_scorer.py
Comment thread pyrit/score/float_scale/float_scale_scorer.py Outdated
@rlundeen2 rlundeen2 self-assigned this May 21, 2026
Comment thread pyrit/score/float_scale/float_scale_scorer.py
Comment thread pyrit/score/float_scale/float_scale_scorer.py Outdated
Comment thread pyrit/score/float_scale/float_scale_scorer.py Outdated
Comment thread pyrit/score/float_scale/float_scale_scorer.py
Comment thread pyrit/score/float_scale/float_scale_scorer.py Outdated
@jsong468 jsong468 self-assigned this May 21, 2026
romanlutz and others added 3 commits May 21, 2026 11:39
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eFalseScorer

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@behnam-o behnam-o left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the unification by introducing the idea of fallback score, and handling in the superclass. I just think the fallback score, by definition, should always have a value, and it can't be optional, or None. But I might be missing nuances, so not blocking.

Comment thread pyrit/score/scorer.py Outdated
romanlutz and others added 5 commits May 21, 2026 14:37
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rer hierarchy

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rter fallback semantics

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@romanlutz
Copy link
Copy Markdown
Contributor Author

Self-correction: reverted the TrueFalseInverterScorer change from commit 5d0f5e7. On further review, the original passthrough-and-invert behavior is correct. When the inner scorer's fallback fires False ("attack did not succeed"), the inverter correctly inverts that to True under the inverter's flipped semantic ("attack failed" or "no compliance detected"), which is the right answer for blocked messages. Forcing the inverter's own fallback instead would have flipped this to a misleading False.

The ConversationScorer fix from the same commit is unaffected and still in place.

romanlutz and others added 2 commits May 22, 2026 11:23
Conflict in doc/code/executor/attack/tap_attack.ipynb resolved by
taking main's version (incorporates the new pyrit.output printer module).
The notebook can be re-executed in a follow-up to refresh cached rationale
strings with the new blocked-fallback messages.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread pyrit/score/conversation_scorer.py
Comment thread pyrit/score/float_scale/float_scale_scorer.py Outdated
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@romanlutz romanlutz added this pull request to the merge queue May 22, 2026
Merged via the queue into microsoft:main with commit 3eb316c May 22, 2026
48 checks passed
@romanlutz romanlutz deleted the romanlutz/unify-error-scoring branch May 22, 2026 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants