FEAT: Add MOSSBench multimodal over-sensitivity dataset loader by romanlutz · Pull Request #1786 · microsoft/PyRIT

romanlutz · 2026-05-23T01:49:38Z

Summary

Adds a PyRIT seed dataset loader for MOSSBench — a multimodal over-sensitivity benchmark (not a jailbreak benchmark). It probes whether a VLM incorrectly refuses harmless multimodal queries when the image contains a superficially risky-looking element.

The dataset has 300 manually curated, benign image+question pairs organized around three cognitive-science triggers from the paper:

Upstream `metadata.over`	Loader enum (`MossBenchOversensitivityType`)	Description
`"type 1"`	`EXAGGERATED_RISK`	Innocuous scene with a visually salient but contextually harmless risky-looking element (e.g. a toy knife on a playroom shelf).
`"type 2"`	`NEGATED_HARM`	Image depicts harm in a way explicitly negated by surrounding context.
`"type 3"`	`COUNTERINTUITIVE_INTERPRETATION`	Image whose obvious literal reading conflicts with the actual user intent.

A well-behaved VLM should answer normally. The paper reports refusal rates up to 76% on SOTA MLLMs.

Design decisions

Image source: GitHub raw at a pinned commit, not HuggingFace.
All 300 PNGs are present in the upstream xirui-li/MOSSBench repo at commit 8d68b061… (verified by counting data/images/*.png). The HF mirror exists but adds gating friction with no upside since the GitHub copy is complete. Images are fetched once via the existing fetch_and_cache_image_async helper.

Over-refusal objective convention: no SeedObjective.
Mirrors the text-only over-refusal loaders (_XSTestDataset, _ORBenchBaseDataset). The "non-refusal expected" semantics live in the dataset's identity (tag over_refusal), not in a per-row objective field. Each example produces an image SeedPrompt + a text SeedPrompt sharing a prompt_group_id and sequence=0, so the orchestrator delivers them as a single multimodal user turn.

No per-loader max_examples parameter.
The scenario layer already enforces a group-aware row limit via DatasetConfiguration.max_dataset_size (counts SeedGroups, samples without replacement, resume-safe via seeded RNG). Duplicating that knob on the loader with different (deterministic-prefix) semantics is a footgun. A separate session is auditing the other multimodal loaders that still expose max_examples for the same reason.

Harm-category mapping: preserved numerically, not labeled.
Upstream metadata.harm is a list of harm-category indices (1–7) describing what risky-looking thing is depicted, not what risk the prompt actually poses. The numeric indices are preserved as metadata["harm_indices"]. No guessed string mapping is published — the upstream code does not document one.

Public API

from pyrit.datasets.seed_datasets.remote import (
    MossBenchOversensitivityType,
    _MossBenchDataset,
)

# All 300 examples
dataset = await _MossBenchDataset().fetch_dataset_async()

# Filter by stimulus type
dataset = await _MossBenchDataset(
    oversensitivity_types=[MossBenchOversensitivityType.EXAGGERATED_RISK],
).fetch_dataset_async()

Per-piece metadata preserved on both the image and text SeedPrompt:

oversensitivity_type (snake_case slug)
oversensitivity_type_label (paper''s human-readable label)
pid, human, child, syn, ocr, harm_indices
short_description
original_image_url (image piece only)

Verification

16 unit tests under tests/unit/datasets/test_mossbench_dataset.py — mocks network/image fetch, covers type filtering, metadata preservation, required-field validation, the no-SeedObjective convention, and rejection of invalid enum input.
Live smoke test fetched all 300 prompt groups (600 image+text pieces) from the pinned GitHub source. Filtering, metadata, and image caching all work end-to-end.
Pre-commit clean: ruff format, ruff check, ty.
Broader unit suite (504 tests) still passes.

Files

pyrit/datasets/seed_datasets/remote/mossbench_dataset.py — loader + MossBenchOversensitivityType enum
pyrit/datasets/seed_datasets/remote/__init__.py — registered (alphabetical)
tests/unit/datasets/test_mossbench_dataset.py — 16 tests
doc/references.bib — added @li2024mossbench
doc/bibliography.md — citation listed
doc/code/datasets/1_loading_datasets.py + .ipynb — mention added

Citation

@article{li2024mossbench,
  title={MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?},
  author={Li, Xirui and Zhou, Hengguang and Wang, Ruochen and Zhou, Tianyi and Cheng, Minhao and Hsieh, Cho-Jui},
  journal={arXiv preprint arXiv:2406.17806},
  year={2024}
}

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

MOSSBench is a 300-example multimodal over-refusal benchmark across three oversensitivity stimulus types (Exaggerated Risk, Negated Harm, Counterintuitive Interpretation). A well-behaved VLM should answer every query normally; refusing indicates over-sensitivity. - Loader fetches the metadata JSON and per-pid PNG images from GitHub at a pinned commit, builds image+text SeedPrompt pairs sharing a prompt_group_id and sequence=0, and skips no SeedObjective is created (mirrors the XSTest/OR-Bench over-refusal convention). - Exposes MossBenchOversensitivityType enum filter and a max_examples cap. - Preserves the raw image attribute flags (human/child/syn/ocr) and the HarmBench-style numeric harm indices in per-seed metadata. - Registers MOSSBench in the remote __init__, references.bib, bibliography.md, and the datasets reference notebook. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PyRIT's Sphinx config is MyST, not reST, so :class: / :func: roles render as raw text. Match the convention used by other dataset loaders (plain double-backtick code spans). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ax_dataset_size) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ataset-loader # Conflicts: # doc/bibliography.md

romanlutz and others added 5 commits May 22, 2026 15:03

Spell out 'over_type' as 'oversensitivity_type' in MOSSBench loader

1bf38e7

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove max_examples from MOSSBench loader (use DatasetConfiguration.m…

364fe3d

…ax_dataset_size) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into romanlutz/mossbench-d…

4c893bd

…ataset-loader # Conflicts: # doc/bibliography.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: Add MOSSBench multimodal over-sensitivity dataset loader#1786

FEAT: Add MOSSBench multimodal over-sensitivity dataset loader#1786
romanlutz wants to merge 5 commits into
microsoft:mainfrom
romanlutz:romanlutz/mossbench-dataset-loader

romanlutz commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

romanlutz commented May 23, 2026

Summary

Design decisions

Public API

Verification

Files

Citation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant