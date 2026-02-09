New UC Berkeley benchmark examines the trade-off between safety and usability in text-to-image models

Text-to-image (T2I) models have become a central part of today’s artificial intelligence ecosystem, powering tools that generate images from short text descriptions. As these systems grow more capable and widely deployed, concerns about safety have followed closely behind. Developers have introduced layers of safeguards to prevent the creation of harmful or inappropriate images, ranging from violent content to privacy violations and copyright misuse.

While these protections are widely viewed as necessary, researchers are beginning to examine an unintended side effect: models that are so cautious they reject requests that are actually safe. This behavior, known as over-refusal, can limit how these tools are used in education, research and creative work.

A new study led by researchers at the University of California, Berkeley introduces the first large-scale benchmark designed to systematically evaluate over-refusal in text-to-image models. The paper, “OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models,” was accepted to the 2025 Conference on Neural Information Processing Systems (NeurIPS), one of the leading venues for machine learning research .

The lead author is Ziheng Cheng, a Ph.D. student in industrial engineering and operations research (IEOR) at UC Berkeley.

Balancing safety and usefulness

Text-to-image models typically rely on automated filters that screen user prompts for potentially risky content. These systems often focus on keywords associated with sensitive topics, such as violence, self-harm or illegal activity. According to the researchers, this approach can cause models to reject prompts that only appear risky on the surface, even when the underlying intent is benign.

To study this phenomenon, the team created OVERT (OVEr-Refusal evaluation on Text-to-Image models), a dataset of 4,600 prompts that are carefully designed to be harmless but likely to trigger refusals. The prompts span nine safety-related categories, including privacy, discrimination, copyright, sexual content and violence. Examples include fictional or educational scenarios that contain sensitive terms but pose no real risk.

To put over-refusal in context, the researchers also constructed a companion dataset, OVERT-unsafe, consisting of 1,785 genuinely harmful prompts. Together, the two datasets allow researchers to examine how well models distinguish between content that should be blocked and content that is merely adjacent to sensitive topics.

Building a scalable benchmark

Rather than manually writing thousands of examples, the research team developed an automated pipeline that uses large language models to generate, filter and refine prompts at scale. Human auditing was incorporated to verify that the final prompts were policy-compliant and accurately labeled. The result is a benchmark that is both broad in scope and systematic in design.

The framework was also built with flexibility in mind. By adjusting prompt-generation instructions, the benchmark can be adapted to reflect different safety policies or interpretations of risk, an important consideration given the diversity of cultural and institutional norms around AI use.

What the results show

Using OVERT, the researchers evaluated several widely used text-to-image models. They found that over-refusal was common across models, though its severity varied by category and system design. Models that were more effective at blocking harmful prompts also tended to refuse benign prompts more frequently, highlighting a clear trade-off between safety and usability.

The team also explored whether rewriting prompts could reduce over-refusal without weakening safeguards. While rewriting sometimes lowered refusal rates, it often changed the original meaning of the prompt, limiting its usefulness as a general solution.

Looking ahead

The researchers emphasize that their work is not an argument for weaker safety controls. Instead, it provides a tool for more precisely measuring how safety mechanisms behave in practice.

“Strong guardrails are essential,” says Cheng, “but understanding when and why models refuse requests is critical to improving both safety and functionality.”

By making over-refusal measurable at scale, OVERT offers researchers and developers a new way to evaluate and refine text-to-image systems as they continue to shape how AI is used in society.