Skip to content

FEAT: Add MOSSBench multimodal over-sensitivity dataset loader#1786

Open
romanlutz wants to merge 5 commits into
microsoft:mainfrom
romanlutz:romanlutz/mossbench-dataset-loader
Open

FEAT: Add MOSSBench multimodal over-sensitivity dataset loader#1786
romanlutz wants to merge 5 commits into
microsoft:mainfrom
romanlutz:romanlutz/mossbench-dataset-loader

Conversation

@romanlutz
Copy link
Copy Markdown
Contributor

Summary

Adds a PyRIT seed dataset loader for MOSSBench — a multimodal over-sensitivity benchmark (not a jailbreak benchmark). It probes whether a VLM incorrectly refuses harmless multimodal queries when the image contains a superficially risky-looking element.

The dataset has 300 manually curated, benign image+question pairs organized around three cognitive-science triggers from the paper:

Upstream metadata.over Loader enum (MossBenchOversensitivityType) Description
"type 1" EXAGGERATED_RISK Innocuous scene with a visually salient but contextually harmless risky-looking element (e.g. a toy knife on a playroom shelf).
"type 2" NEGATED_HARM Image depicts harm in a way explicitly negated by surrounding context.
"type 3" COUNTERINTUITIVE_INTERPRETATION Image whose obvious literal reading conflicts with the actual user intent.

A well-behaved VLM should answer normally. The paper reports refusal rates up to 76% on SOTA MLLMs.

Design decisions

Image source: GitHub raw at a pinned commit, not HuggingFace.
All 300 PNGs are present in the upstream xirui-li/MOSSBench repo at commit 8d68b061… (verified by counting data/images/*.png). The HF mirror exists but adds gating friction with no upside since the GitHub copy is complete. Images are fetched once via the existing fetch_and_cache_image_async helper.

Over-refusal objective convention: no SeedObjective.
Mirrors the text-only over-refusal loaders (_XSTestDataset, _ORBenchBaseDataset). The "non-refusal expected" semantics live in the dataset's identity (tag over_refusal), not in a per-row objective field. Each example produces an image SeedPrompt + a text SeedPrompt sharing a prompt_group_id and sequence=0, so the orchestrator delivers them as a single multimodal user turn.

No per-loader max_examples parameter.
The scenario layer already enforces a group-aware row limit via DatasetConfiguration.max_dataset_size (counts SeedGroups, samples without replacement, resume-safe via seeded RNG). Duplicating that knob on the loader with different (deterministic-prefix) semantics is a footgun. A separate session is auditing the other multimodal loaders that still expose max_examples for the same reason.

Harm-category mapping: preserved numerically, not labeled.
Upstream metadata.harm is a list of harm-category indices (1–7) describing what risky-looking thing is depicted, not what risk the prompt actually poses. The numeric indices are preserved as metadata["harm_indices"]. No guessed string mapping is published — the upstream code does not document one.

Public API

from pyrit.datasets.seed_datasets.remote import (
    MossBenchOversensitivityType,
    _MossBenchDataset,
)

# All 300 examples
dataset = await _MossBenchDataset().fetch_dataset_async()

# Filter by stimulus type
dataset = await _MossBenchDataset(
    oversensitivity_types=[MossBenchOversensitivityType.EXAGGERATED_RISK],
).fetch_dataset_async()

Per-piece metadata preserved on both the image and text SeedPrompt:

  • oversensitivity_type (snake_case slug)
  • oversensitivity_type_label (paper''s human-readable label)
  • pid, human, child, syn, ocr, harm_indices
  • short_description
  • original_image_url (image piece only)

Verification

  • 16 unit tests under tests/unit/datasets/test_mossbench_dataset.py — mocks network/image fetch, covers type filtering, metadata preservation, required-field validation, the no-SeedObjective convention, and rejection of invalid enum input.
  • Live smoke test fetched all 300 prompt groups (600 image+text pieces) from the pinned GitHub source. Filtering, metadata, and image caching all work end-to-end.
  • Pre-commit clean: ruff format, ruff check, ty.
  • Broader unit suite (504 tests) still passes.

Files

  • pyrit/datasets/seed_datasets/remote/mossbench_dataset.py — loader + MossBenchOversensitivityType enum
  • pyrit/datasets/seed_datasets/remote/__init__.py — registered (alphabetical)
  • tests/unit/datasets/test_mossbench_dataset.py — 16 tests
  • doc/references.bib — added @li2024mossbench
  • doc/bibliography.md — citation listed
  • doc/code/datasets/1_loading_datasets.py + .ipynb — mention added

Citation

@article{li2024mossbench,
  title={MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?},
  author={Li, Xirui and Zhou, Hengguang and Wang, Ruochen and Zhou, Tianyi and Cheng, Minhao and Hsieh, Cho-Jui},
  journal={arXiv preprint arXiv:2406.17806},
  year={2024}
}

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

romanlutz and others added 5 commits May 22, 2026 15:03
MOSSBench is a 300-example multimodal over-refusal benchmark across three

oversensitivity stimulus types (Exaggerated Risk, Negated Harm,

Counterintuitive Interpretation). A well-behaved VLM should answer every

query normally; refusing indicates over-sensitivity.

- Loader fetches the metadata JSON and per-pid PNG images from GitHub at a

  pinned commit, builds image+text SeedPrompt pairs sharing a

  prompt_group_id and sequence=0, and skips no SeedObjective is created

  (mirrors the XSTest/OR-Bench over-refusal convention).

- Exposes MossBenchOversensitivityType enum filter and a max_examples cap.

- Preserves the raw image attribute flags (human/child/syn/ocr) and the

  HarmBench-style numeric harm indices in per-seed metadata.

- Registers MOSSBench in the remote __init__, references.bib,

  bibliography.md, and the datasets reference notebook.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PyRIT's Sphinx config is MyST, not reST, so :class: / :func: roles render

as raw text. Match the convention used by other dataset loaders (plain

double-backtick code spans).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ax_dataset_size)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ataset-loader

# Conflicts:
#	doc/bibliography.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant