Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
2aa7896
REFACTOR: unify error/blocked scoring across scorers
romanlutz May 18, 2026
44baab3
REFACTOR: convert scorer base-class docstrings from RST to MyST markdown
romanlutz May 19, 2026
cdc282b
STYLE: hoist inline imports in score tests
romanlutz May 19, 2026
08e57de
DOC: re-execute TAP notebook after error-scoring refactor
romanlutz May 19, 2026
da575ab
TEST: drop error_score_map regression-guard tests in TAP
romanlutz May 21, 2026
ced931a
DOC: clarify FloatScaleScorer blocked-fallback docstring
romanlutz May 21, 2026
aad26dc
REFACTOR: extract blocked-fallback helper in FloatScaleScorer and Tru…
romanlutz May 21, 2026
a79e21b
REFACTOR: hoist blocked-fallback hook into base Scorer
romanlutz May 21, 2026
40bd2c7
DOC: spell out blocked-fallback conditions in Score rationale
romanlutz May 21, 2026
08e4827
REFACTOR: inline Score construction in TrueFalseScorer aggregator return
romanlutz May 21, 2026
48f2415
REFACTOR: make Scorer._build_fallback_score abstract and document sco…
romanlutz May 22, 2026
5d0f5e7
FIX: preserve full-conversation scoring on blocked turns and fix inve…
romanlutz May 22, 2026
ac26486
REVERT: restore TrueFalseInverterScorer passthrough-and-invert behavior
romanlutz May 22, 2026
88d869c
Merge branch 'main' into romanlutz/unify-error-scoring
romanlutz May 22, 2026
c3e0a9a
DOC: re-execute TAP notebook against live targets post-merge
romanlutz May 22, 2026
137ef1a
REFACTOR: emit per-category fallback scores in AzureContentFilterScorer
romanlutz May 22, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,274 changes: 170 additions & 1,104 deletions doc/code/executor/attack/tap_attack.ipynb

Large diffs are not rendered by default.

7 changes: 4 additions & 3 deletions doc/code/executor/attack/tap_attack.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,9 +83,10 @@
# 1. **System Prompt**: Use `TAPSystemPromptPaths.IMAGE_GENERATION` to provide
# an adversarial system prompt tailored for image generation models.
# 2. **Error Handling**: Image generation targets frequently return "blocked"
# responses due to content filters. TAP's `error_score_map` (default:
# `{"blocked": 0.0}`) automatically assigns a score of 0.0 to these responses
# instead of failing the branch, preventing premature pruning of all branches.
# responses due to content filters. PyRIT's default scorers (TrueFalseScorer
# and FloatScaleScorer) automatically return `False` / `0.0` for blocked
# responses, so blocked branches receive a score of `0.0` instead of failing
# the branch — preventing premature pruning of all branches.
# 3. **Scoring**: The default TAP scorer automatically detects the target's output
# modalities. For image targets, it configures the scorer to accept `image_path`
# responses. The adversarial chat target (used for scoring) should be a multimodal
Expand Down
42 changes: 42 additions & 0 deletions doc/code/scoring/0_scoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,48 @@ This collection of notebooks shows how to use scorers directly. To see how to us

There are two general types of scorers. `true_false` and `float_scale` (these can often be converted to one type or another). A `true_false` scorer scores something as true or false, and can be used in attacks for things like success criteria. `float_scale` scorers normalize a score between 0 and 1 to try and quantify a level of something (e.g. harmful content).

The scorer hierarchy is rooted at the abstract `Scorer` class. All concrete scorers derive from one of three intermediate base classes: `FloatScaleScorer`, `TrueFalseScorer`, or `ConversationScorer`.

```mermaid
classDiagram
class Scorer { <<abstract>> }
class FloatScaleScorer { <<abstract>> }
class TrueFalseScorer { <<abstract>> }
class ConversationScorer { <<abstract>> }

Scorer <|-- FloatScaleScorer
Scorer <|-- TrueFalseScorer
Scorer <|-- ConversationScorer

FloatScaleScorer <|-- AzureContentFilterScorer
FloatScaleScorer <|-- AudioFloatScaleScorer
FloatScaleScorer <|-- InsecureCodeScorer
FloatScaleScorer <|-- PlagiarismScorer
FloatScaleScorer <|-- SelfAskGeneralFloatScaleScorer
FloatScaleScorer <|-- SelfAskLikertScorer
FloatScaleScorer <|-- SelfAskScaleScorer
FloatScaleScorer <|-- VideoFloatScaleScorer

TrueFalseScorer <|-- AudioTrueFalseScorer
TrueFalseScorer <|-- DecodingScorer
TrueFalseScorer <|-- FloatScaleThresholdScorer
TrueFalseScorer <|-- GandalfScorer
TrueFalseScorer <|-- MarkdownInjectionScorer
TrueFalseScorer <|-- PromptShieldScorer
TrueFalseScorer <|-- QuestionAnswerScorer
TrueFalseScorer <|-- SelfAskCategoryScorer
TrueFalseScorer <|-- SelfAskGeneralTrueFalseScorer
TrueFalseScorer <|-- SelfAskRefusalScorer
TrueFalseScorer <|-- SelfAskTrueFalseScorer
TrueFalseScorer <|-- SubStringScorer
TrueFalseScorer <|-- TrueFalseCompositeScorer
TrueFalseScorer <|-- TrueFalseInverterScorer
TrueFalseScorer <|-- VideoTrueFalseScorer
SelfAskTrueFalseScorer <|-- SelfAskQuestionAnswerScorer
```

`ConversationScorer` is special: it is never instantiated on its own. Instead, `create_conversation_scorer()` dynamically builds a subclass that mixes `ConversationScorer` with either `FloatScaleScorer` or `TrueFalseScorer`, so the resulting scorer inherits its `_build_fallback_score` behavior from whichever scoring base it was paired with. `FloatScaleThresholdScorer` wraps a `FloatScaleScorer` to produce a `true_false` result.

[Scores](../../../pyrit/models/score.py) are stored in memory as score objects.

## Setup
108 changes: 19 additions & 89 deletions pyrit/executor/attack/multi_turn/tree_of_attacks.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
import uuid
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Optional, cast, get_args, overload
from typing import TYPE_CHECKING, Any, Optional, cast, overload

from treelib.tree import Tree

Expand Down Expand Up @@ -50,7 +50,6 @@
Score,
SeedPrompt,
)
from pyrit.models.literals import PromptDataType, PromptResponseError
from pyrit.prompt_normalizer import PromptConverterConfiguration, PromptNormalizer
from pyrit.prompt_target import CapabilityName, PromptTarget
from pyrit.prompt_target.common.target_requirements import TargetRequirements
Expand All @@ -66,36 +65,10 @@
from pyrit.score.scorer_prompt_validator import ScorerPromptValidator
from pyrit.score.true_false.true_false_inverter_scorer import TrueFalseInverterScorer

logger = logging.getLogger(__name__)

_DEFAULT_ERROR_SCORE_MAP: dict[str, float] = {"blocked": 0.0}


def _validate_error_score_map(error_score_map: dict[str, float] | None) -> dict[str, float]:
"""
Validate and return a copy of the error score map.
if TYPE_CHECKING:
from pyrit.models.literals import PromptDataType

Args:
error_score_map (dict[str, float] | None): The error score map to validate.
None uses the default mapping. An empty dict disables error mapping.

Returns:
dict[str, float]: A validated copy of the error score map.

Raises:
ValueError: If a key is not a valid PromptResponseError or a value is outside [0, 1].
"""
if error_score_map is None:
return dict(_DEFAULT_ERROR_SCORE_MAP)
valid_errors = get_args(PromptResponseError)
for key, value in error_score_map.items():
if key not in valid_errors:
raise ValueError(
f"error_score_map key '{key}' is not a valid PromptResponseError. Valid values: {valid_errors}"
)
if not (0.0 <= value <= 1.0):
raise ValueError(f"error_score_map value for '{key}' must be between 0.0 and 1.0, got {value}")
return dict(error_score_map)
logger = logging.getLogger(__name__)


class TAPSystemPromptPaths(enum.Enum):
Expand Down Expand Up @@ -322,7 +295,6 @@ def __init__(
parent_id: Optional[str] = None,
prompt_normalizer: Optional[PromptNormalizer] = None,
initial_prompt: Optional[Message] = None,
error_score_map: dict[str, float] | None = None,
) -> None:
"""
Initialize a tree node.
Expand All @@ -346,16 +318,6 @@ def __init__(
prompt_normalizer (Optional[PromptNormalizer]): Normalizer for handling prompts and responses.
initial_prompt (Optional[Message]): Initial message to send for the first turn,
bypassing adversarial chat generation. Supports multimodal messages.
error_score_map (dict[str, float] | None): Mapping of response error types to fixed
scores. When a target response has an error matching a key in this map, the
corresponding score is assigned instead of invoking the scorer. This prevents
premature branch pruning when targets return blocked/filtered responses.
Defaults to {"blocked": 0.0}. Pass an empty dict to disable.

Note: This check runs before the scorer, so if ``score_blocked_content``
is set on the objective scorer, it will have no effect for error types
present in this map. To evaluate partial content from blocked responses,
pass ``error_score_map={}`` to disable the early-return.
"""
# Store configuration
self._objective_target = objective_target
Expand All @@ -372,7 +334,6 @@ def __init__(
self._attack_id = attack_id
self._attack_strategy_name = attack_strategy_name
self._memory_labels = memory_labels or {}
self._error_score_map = _validate_error_score_map(error_score_map)

# Initialize utilities
self._memory = CentralMemory.get_memory_instance()
Expand Down Expand Up @@ -688,9 +649,12 @@ async def _score_response_async(self, *, response: Message, objective: str) -> N
and any auxiliary scorers (which provide additional metrics). The scoring results are
used by the TAP algorithm to decide which branches to explore further.

The method leverages the Scorer utility to handle all scoring logic, including error
handling and parallel execution of multiple scorers. Responses with errors are skipped
to avoid scoring failures from blocking the attack progress.
Blocked or errored responses are scored via the scorer's unified default behavior:
:class:`~pyrit.score.true_false.true_false_scorer.TrueFalseScorer` returns
``Score(False)`` and :class:`~pyrit.score.float_scale.float_scale_scorer.FloatScaleScorer`
returns ``Score(0.0)`` whenever no supported pieces remain after validator filtering
(the normal outcome for a blocked piece). This keeps blocked branches at the bottom
of the priority queue without needing attack-level error mapping.

Args:
response (Message): The response from the objective target to evaluate.
Expand All @@ -707,34 +671,6 @@ async def _score_response_async(self, *, response: Message, objective: str) -> N
Higher scores indicate more successful attacks and influence which branches
the TAP algorithm explores in subsequent iterations.
"""
# Check if the response has a mapped error before attempting normal scoring.
# This prevents scorer failures when the target returns a blocked/filtered response
# (e.g., content policy violations from image generation targets).
if self._error_score_map and response.is_error():
for response_piece in response.message_pieces:
error_type = response_piece.response_error
if error_type in self._error_score_map:
assigned_score = self._error_score_map[error_type]
logger.debug(
f"Node {self.node_id}: Response has mapped error '{error_type}', "
f"assigning score {assigned_score}"
)
self.objective_score = Score(
score_value=str(assigned_score),
score_value_description=(f"Assigned score {assigned_score} for '{error_type}' response error"),
score_type="float_scale",
score_category=["error_handling"],
score_rationale=(
f"Response had '{error_type}' error. Assigned fixed score {assigned_score} "
f"via error_score_map to prevent premature branch pruning."
),
message_piece_id=str(response_piece.id),
scorer_class_identifier=self._objective_scorer.get_identifier(),
objective=objective,
)
self._memory.add_scores_to_memory(scores=[self.objective_score])
return

# Use the Scorer utility method to handle all scoring
with execution_context(
component_role=ComponentRole.OBJECTIVE_SCORER,
Expand Down Expand Up @@ -874,7 +810,6 @@ def duplicate(self) -> "_TreeOfAttacksNode":
desired_response_prefix=self._desired_response_prefix,
parent_id=self.node_id,
prompt_normalizer=self._prompt_normalizer,
error_score_map=self._error_score_map,
)

# Duplicate the conversations to preserve history
Expand Down Expand Up @@ -1370,7 +1305,6 @@ def __init__(
desired_response_prefix: str = "Sure, here is",
batch_size: int = 10,
prepended_conversation_config: PrependedConversationConfig | None = None,
error_score_map: dict[str, float] | None = None,
) -> None:
"""
Initialize the Tree of Attacks with Pruning attack strategy.
Expand All @@ -1395,22 +1329,20 @@ def __init__(
prepended_conversation_config (PrependedConversationConfig | None):
Configuration for how to process prepended conversations. Controls converter
application by role, message normalization, and non-chat target behavior.
error_score_map (dict[str, float] | None): Mapping of response error types to fixed
scores. When a target response has an error matching a key in this map, the
corresponding score is assigned instead of invoking the scorer. This prevents
premature branch pruning when targets return blocked/filtered responses (e.g.,
content policy violations from image generation targets). Defaults to
{"blocked": 0.0}. Pass an empty dict to disable.

Note: This check runs before the scorer, so if ``score_blocked_content``
is set on the objective scorer, it will have no effect for error types
present in this map. To evaluate partial content from blocked responses,
pass ``error_score_map={}`` to disable the early-return.

Raises:
ValueError: If attack_scoring_config uses a non-FloatScaleThresholdScorer objective scorer,
if the adversarial target does not natively support the capabilities TAP needs,
or if parameters are invalid.

Note:
Blocked or errored target responses (e.g. content filter triggers from image
generation targets) are scored ``0.0`` via the unified
:class:`~pyrit.score.float_scale.float_scale_scorer.FloatScaleScorer` default,
which prevents premature pruning without any attack-level error mapping. To
score partial content from blocked responses, set
``score_blocked_content=True`` on the objective scorer (requires
``prompt_metadata["partial_content"]`` on the blocked piece).
"""
# Validate tree parameters
if tree_depth < 1:
Expand All @@ -1437,7 +1369,6 @@ def __init__(
self._on_topic_checking_enabled = on_topic_checking_enabled
self._desired_response_prefix = desired_response_prefix
self._batch_size = batch_size
self._error_score_map = _validate_error_score_map(error_score_map)

# Initialize adversarial configuration
self._adversarial_chat = attack_adversarial_config.target
Expand Down Expand Up @@ -2032,7 +1963,6 @@ def _create_attack_node(
parent_id=parent_id,
prompt_normalizer=self._prompt_normalizer,
initial_prompt=initial_prompt,
error_score_map=self._error_score_map,
)

# Add the adversarial chat conversation ID to the context's tracking (ensuring uniqueness)
Expand Down
16 changes: 12 additions & 4 deletions pyrit/score/conversation_scorer.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,21 @@ class ConversationScorer(Scorer, ABC):

_DEFAULT_VALIDATOR: ScorerPromptValidator = ScorerPromptValidator(
supported_data_types=["text"],
enforce_all_pieces_valid=True,
enforce_all_pieces_valid=False,
)

async def _score_async(self, message: Message, *, objective: Optional[str] = None) -> list[Score]:
"""
Scores the entire conversation history by concatenating all messages and passing to the wrapped scorer.

The synthetic conversation Message is always built as ``text`` regardless of the
triggering piece's data type or error state. Errors from individual turns are
preserved within the rendered text (either as the rendered error JSON or, with
``score_blocked_content`` enabled, as the partial content). This ensures the wrapped
scorer's text-only validator accepts the synthetic message and scores the full
conversation, even when the triggering turn was blocked or errored; the wrapped
scorer's fallback only fires when the rendered conversation is genuinely unscoreable.

Args:
message (Message): A message from the conversation to be scored.
The conversation ID from the first message piece is used to retrieve the full conversation from memory.
Expand Down Expand Up @@ -97,9 +105,9 @@ async def _score_async(self, message: Message, *, objective: Optional[str] = Non
labels=original_piece.labels, # deprecated
prompt_target_identifier=original_piece.prompt_target_identifier,
attack_identifier=original_piece.attack_identifier,
original_value_data_type=original_piece.original_value_data_type,
converted_value_data_type=original_piece.converted_value_data_type,
response_error=original_piece.response_error,
original_value_data_type="text",
converted_value_data_type="text",
response_error="none",
Comment thread
romanlutz marked this conversation as resolved.
originator=original_piece.originator,
original_prompt_id=(
cast("UUID", original_piece.original_prompt_id)
Expand Down
61 changes: 61 additions & 0 deletions pyrit/score/float_scale/azure_content_filter_scorer.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
from pyrit.identifiers import ComponentIdentifier
from pyrit.models import (
DataTypeSerializer,
Message,
MessagePiece,
Score,
data_serializer_factory,
Expand Down Expand Up @@ -342,6 +343,66 @@ async def _score_piece_async(self, message_piece: MessagePiece, *, objective: Op
for result in aggregated_results
]

def _build_fallback_score(self, *, message: Message, objective: Optional[str]) -> list[Score]:
"""
Build one neutral ``0.0`` fallback score per configured harm category.

AzureContentFilterScorer's normal output is one score per category in
``self._harm_categories``. To preserve that shape on blocked / error / filtered
input, this override emits one neutral ``0.0`` score per configured category
(each tagged with the category name and matching the normal-path metadata),
instead of the single category-less score produced by the base
``FloatScaleScorer._build_fallback_score``.

Inspects the first message piece to tailor the rationale/description for
blocked, error, and filtered cases.

Args:
message (Message): The message whose first piece is inspected for status.
objective (Optional[str]): The objective associated with this scoring call.

Returns:
list[Score]: One ``0.0`` ``float_scale`` score per configured harm category,
each attributed to the first piece.

Raises:
ValueError: If the first message piece has no ``id`` or ``original_prompt_id``.
"""
first_piece = message.message_pieces[0]
piece_id = first_piece.id or first_piece.original_prompt_id
if piece_id is None:
raise ValueError("Cannot create score: message piece has no id or original_prompt_id")

if first_piece.is_blocked():
status = (
"The request was blocked by the target (score_blocked_content is False or no partial content available)"
)
description = "Blocked response; returning 0.0 per configured category."
elif first_piece.has_error():
status = f"Response had an error: {first_piece.response_error}"
description = "Error response; returning 0.0 per configured category."
else:
status = "No supported pieces to score after filtering"
description = "No pieces to score after filtering; returning 0.0 per configured category."

rationale = f"{status}; returning 0.0 for each configured harm category."
metadata: dict[str, str | int | float] = {"azure_severity": 0}

return [
Score(
score_value="0.0",
score_value_description=description,
score_type="float_scale",
score_category=[category.value],
score_metadata=metadata,
score_rationale=rationale,
scorer_class_identifier=self.get_identifier(),
message_piece_id=piece_id,
objective=objective,
)
for category in self._harm_categories
]

async def _get_base64_image_data(self, message_piece: MessagePiece) -> str:
"""
Get base64-encoded image data from a message piece.
Expand Down
Loading
Loading