Skip to content

FEAT Add VLGuard multimodal safety dataset loader#1447

Open
romanlutz wants to merge 2 commits intoAzure:mainfrom
romanlutz:romanlutz/vlguard-dataset
Open

FEAT Add VLGuard multimodal safety dataset loader#1447
romanlutz wants to merge 2 commits intoAzure:mainfrom
romanlutz:romanlutz/vlguard-dataset

Conversation

@romanlutz
Copy link
Contributor

Summary

Adds support for the VLGuard dataset (ICML 2024), a vision-language safety benchmark that evaluates whether multimodal models refuse unsafe content while remaining helpful on safe content.

What is VLGuard?

VLGuard contains ~2,000 image-instruction pairs across 4 categories (Privacy, Risky Behavior, Deception, Hateful Speech) and 8 subcategories (Personal Data, Professional Advice, Political, Sexually
Explicit, Violence, Disinformation, Discrimination by Sex, Discrimination by Race).

It supports three evaluation subsets:

  • unsafes — unsafe images with instructions (tests whether the model refuses to describe unsafe visual content)
  • safe_unsafes — safe images with unsafe instructions (tests whether the model refuses unsafe text prompts)
  • safe_safes — safe images with safe instructions (tests whether the model remains helpful)

Usage

from pyrit.datasets.seed_datasets.remote import _VLGuardDataset, VLGuardCategory, VLGuardSubset

Load unsafe image examples (default)

loader = VLGuardDataset(token="hf...")
dataset = await loader.fetch_dataset()

Load safe images with unsafe instructions, filtered to Privacy category

loader = VLGuardDataset(
subset=VLGuardSubset.SAFE_UNSAFES,
categories=[VLGuardCategory.PRIVACY],
token="hf
...",
)
dataset = await loader.fetch_dataset()

Note: This is a gated dataset on HuggingFace. Users must accept the terms at https://huggingface.co/datasets/ys-zong/VLGuard and provide a HuggingFace token.

Changes

  • pyrit/datasets/seed_datasets/remote/vlguard_dataset.py — new dataset loader
  • pyrit/datasets/seed_datasets/remote/init.py — register exports
  • tests/unit/datasets/test_vlguard_dataset.py — 14 unit tests
  • doc/code/datasets/1_loading_datasets.ipynb — regenerated to show VLGuard in dataset list

romanlutz and others added 2 commits March 10, 2026 05:33
Add support for the VLGuard dataset (ICML 2024) which contains image-instruction
pairs for evaluating vision-language model safety across 4 categories (Privacy,
Risky Behavior, Deception, Hateful Speech) with 8 subcategories.

Supports three evaluation subsets:
- unsafes: unsafe images with instructions (tests refusal)
- safe_unsafes: safe images with unsafe instructions (tests refusal)
- safe_safes: safe images with safe instructions (tests helpfulness)

Downloads from HuggingFace (gated dataset, requires token and terms acceptance).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@romanlutz romanlutz force-pushed the romanlutz/vlguard-dataset branch from cac3cad to 255dd50 Compare March 10, 2026 13:24


class VLGuardCategory(Enum):
"""Categories in the VLGuard dataset."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we add a brief explainer or example of each category?

models refuse unsafe content while remaining helpful on safe content.

The dataset covers 4 categories (Privacy, Risky Behavior, Deception, Hateful Speech)
with 8 subcategories (Personal Data, Professional Advice, Political, Sexually Explicit,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't subcategories in the enum? They don't have to be, just curious

prompts.append(text_prompt)
prompts.append(image_prompt)

if self.max_examples is not None and len(prompts) >= self.max_examples * 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self.max_examples is not None and len(prompts) >= self.max_examples * 2:
# Note that len(prompts) is divided by two since each prompt includes one image and one text field.
if self.max_examples is not None and len(prompts) >= self.max_examples * 2:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants