A Safety Report
ON GPT-5.2, GEMINI 3 PRO, QWEN3-VL, GROK 4.1 FAST,
NANO BANANA PRO, AND SEEDREAM 4.5
Xingjun Ma1,2
Yixu Wang1
Hengyuan Xu1
Yutao Wu3
Yifan Ding1
Yunhan Zhao1
Zilong Wang1
Jiabin Hua1
Ming Wen1,2
Jianan Liu1,2
Ranjie Duan
Yifeng Gao1
Yingshui Tan
Yunhao Chen1
Hui Xue
Xin Wang1
Wei Cheng
Jingjing Chen1
Zuxuan Wu1
Bo Li4
Yu-Gang Jiang1
Fudan University1
Shanghai Innovation Institute2
Deakin University3
UIUC4
Leaderboard
The safety leaderboard provides a comparative view of frontier models across multiple dimensions, including benchmark performance, adversarial robustness, multilingual generalization, and regulatory compliance, spanning language, vision–language, and image generation settings. Overall, the results reveal a highly uneven safety landscape: while a small number of models achieve consistently strong and balanced performance across most evaluations, others exhibit clear trade-offs—performing well on standard benchmarks but degrading sharply under adversarial or cross-lingual conditions. Notably, strong benchmark scores do not necessarily translate into real-world robustness, highlighting the importance of multi-axis evaluation rather than single-score rankings.
- GPT-5.2 consistently leads across all four evaluation schemes, achieving top performance in Benchmark Evaluation (91.59%), Adversarial Robustness (54.26%), Multilingual Safety (77.50%), and Regulatory Compliance (90.22%). This uniformly strong showing indicates well-balanced and deeply integrated safety mechanisms that generalize effectively across modalities, languages, and attack settings.
- Gemini 3 Pro exhibits strong but uneven safety performance, ranking second in Benchmark Evaluation (88.06%) and Multilingual Safety (67.00%), and third in Compliance Evaluation (73.54%). However, its adversarial robustness drops noticeably to 41.17%, revealing sensitivity to attack-driven inputs despite solid baseline alignment.
- Qwen3-VL demonstrates a mixed safety profile, with competitive performance in Benchmark Evaluation (80.19%) and strong Regulatory Compliance (77.11%, second overall), but substantially weaker Adversarial Robustness (33.42%) and lower Multilingual Safety (64.00%). This pattern suggests that its safety mechanisms are more tightly coupled to compliance-oriented constraints than to adversarial or cross-lingual generalization.
- Grok 4.1 Fast ranks last or near-last across all dimensions, with relatively low scores in Benchmark Evaluation (66.60%), Adversarial Robustness (46.39%), Multilingual Safety (45.97%), and Regulatory Compliance (45.97%). The consistently weak performance highlights systemic deficiencies in its safety guardrails, particularly under adversarial and multilingual conditions.
- GPT-5.2 consistently dominates both evaluation regimes, achieving near-saturated performance under adversarial evaluation (97.24%) and leading the benchmark setting (92.14%), indicating exceptional robustness against both standard and attack-driven safety risks.
- Qwen3-VL ranks second across both Benchmark (83.32%) and Adversarial (78.89%) evaluations, maintaining a consistent advantage over Gemini 3 Pro and demonstrating stable safety performance under adversarial pressure.
- Gemini 3 Pro places third, with solid but clearly lower scores of 82.53% on benchmarks and 75.44% under adversarial evaluation, reflecting moderate resilience but a noticeable gap relative to the top two models.
- Grok 4.1 Fast ranks fourth in both benchmark (67.97%) and adversarial (68.34%) evaluations, exhibiting a slight and somewhat counterintuitive score increase under adversarial conditions. This pattern suggests that its safety performance is largely insensitive to attack-driven perturbations, pointing to shallow guardrail behavior rather than safety generalization.
- Nano Banana Pro consistently outperforms its counterpart across all three evaluation dimensions, ranking first in Benchmark Evaluation (60.00%), Adversarial Evaluation (54.00%), and Regulatory Compliance (65.59%). The monotonic improvement from benchmark to adversarial and compliance settings suggests relatively robust and well-aligned safety controls that generalize beyond static prompt distributions, particularly in regulatory-sensitive image generation scenarios.
- Seedream 4.5 ranks second across all evaluation dimensions, with notably lower scores in Benchmark Evaluation (47.94%), Adversarial Evaluation (19.67%), and Regulatory Compliance (57.53%). While its regulatory compliance score shows some recovery relative to benchmark and adversarial settings, the overall performance indicates weaker baseline safeguards and limited robustness under adversarial t2i attacks.
Safety Profiles
The safety profile characterizes each model’s alignment behavior as a multidimensional pattern rather than a scalar score. By examining performance across benchmark, adversarial, multilingual, and compliance axes, distinct safety archetypes emerge, ranging from well-balanced generalists to rule-driven or guardrail-light models with pronounced weaknesses. These profiles show that safety failures often stem from structural design choices—such as reliance on rigid rules or surface-level filters—rather than isolated bugs. Taken together, the profiles underscore that model safety is inherently contextual and modality-dependent, reinforcing the need for holistic, profile-based assessment to understand real deployment risks.
- The Comprehensive Generalist (GPT-5.2). GPT-5.2 exhibits the most complete and balanced safety profile, with a radar chart approaching saturation across nearly all dimensions. Its performance remains consistently high from static benchmarks to jailbreak attacks and regulatory compliance. This stability suggests that safety constraints are internalized at a semantic and reasoning level rather than enforced through brittle pattern-based filters. As a result, GPT-5.2 is able to handle gray-area and context-rich queries with calibrated refusals, avoiding both over-refusal and jailbreak susceptibility.
- The Robust but Reactive Aligner (Gemini 3 Pro). Gemini 3 Pro demonstrates a strong but slightly retracted safety footprint relative to GPT-5.2. Its radar profile shows solid benchmark and multilingual performance, particularly in socially grounded tasks such as bias and toxicity detection. However, visible indentations along the adversarial and regulatory axes indicate a more reactive safety posture. Qualitative inspection suggests that Gemini 3 Pro often identifies harmful intent after partial compliance (e.g., comply-then-warn behaviors) or relies on rigid refusal triggers. While effective against explicit harm, this strategy is less resilient to adversarial reframing and contextual manipulation.
- The Polarized Rule-Follower (Qwen3-VL). Qwen3-VL displays a sharply uneven, spiked safety spectrum. It excels in Regulatory Compliance and performs competitively in multilingual safety, even surpassing Gemini 3 Pro in certain governance-aligned dimensions. However, its adversarial robustness and social bias handling collapse markedly, producing a highly polarized profile. This pattern is indicative of a rule-centric alignment strategy: the model adheres strongly to explicit, codified constraints but struggles when safety requires semantic generalization or contextual inference. Consequently, Qwen3-VL is highly reliable within known regulatory boundaries, yet brittle under semantic disguise and novel attack strategies.
- The Guardrail-Light Instruction Follower (Grok 4.1 Fast). Grok 4.1 Fast shows the most uniformly diminished safety profile among language models, with consistently low scores across benchmark, adversarial, multilingual, and regulatory dimensions. It exhibits systemic safety deficiencies even under standard evaluation. The radar chart suggests minimal internalization of safety concepts and heavy reliance on lightweight or surface-level filtering, resulting in poor robustness across virtually all tested settings.
- The Divergent T2I Safety Strategies (Nano Banana Pro vs. Seedream 4.5). For the two T2I models, the radar charts reveal two contrasting alignment philosophies. Nano Banana Pro exhibits a sanitization-oriented profile, maintaining broader coverage across benchmark, adversarial, and compliance dimensions by implicitly transforming unsafe prompts into safer visual outputs. This strategy preserves utility while reducing harm. In contrast, Seedream 4.5 displays a block-or-leak profile: it relies on aggressive binary refusals but lacks robust semantic grounding for borderline cases, leading to severe failures when these coarse filters are bypassed. The divergence highlights a fundamental trade-off between generative flexibility and safety robustness in image generation systems.
Large Language Model Benchmark Evaluation
In standard safety benchmarks, GPT-5.2 achieves the highest macro-average safe rate of 91.59%, closely followed by Gemini 3 Pro at 88.06%. While most models perform well on refusal-based tasks like StrongREJECT, there is significant variance in social reasoning; for instance, Qwen3-VL struggles severely on the BBQ bias benchmark with a score of only 45.00%. This indicates that while frontier models have improved at rejecting explicit harmful instructions, they remain structurally weak in detecting subtle social biases and ensuring fairness.
Figure 3: Safe rate of five models across five benchmarks.
Figure 4: Examples of model responses to standard safety benchmark prompts.
LLM Adversarial Evaluation
Despite strong benchmark performance, models remain vulnerable to jailbreaks, with no model exceeding 85% worst-case safety. GPT-5.2 is the most robust (82.00% worst-case safety), whereas Doubao 1.8 and Grok 4.1 Fast exhibit catastrophic collapses, scoring only 12.00% and 4.00% respectively. The evaluation reveals that while static template-based attacks are often mitigated, models are highly susceptible to sophisticated, multi-turn agentic attacks (like X-Teaming) that progressively decompose harmful objectives.
Figure 5: Adversarial evaluation results across five models.
Figure 6: Qualitative examples of successful jailbreak attempts.
Vision-Language Model Benchmark Evaluation
GPT-5.2 leads the multimodal benchmark with a 92.14% safe rate, showing consistent generalization, while models like Grok 4.1 Fast and Doubao 1.8 lag significantly behind. Performance is uneven across risk types; models handle cross-modal misalignment well (SIUO) but struggle with culturally grounded implicit harm in memes, where scores drop as low as 44%. A key failure mode is "analytical operationalization," where models provide dangerous, actionable details under the guise of neutral academic analysis when processing visual inputs.
Figure 11: Safety scores for Vision-Language Models on standard benchmarks.
Figure 12: Examples of VLM responses to unsafe visual and textual inputs.
VLM Adversarial Evaluation
In adversarial settings, GPT-5.2 maintains an exceptional safe rate of 97.24%, while other models suffer sharp degradation, particularly on the VLJailbreakBench where Qwen3-VL and Gemini 3 Pro drop to approximately 60%. Models frequently fail due to "refusal drift" in multi-turn dialogues or when harmful intent is obscured by complex formatting and role-play. Doubao 1.8 is particularly brittle, often collapsing to direct policy-override instructions.
Figure 13: Safe rates under multimodal adversarial evaluation.
Figure 14: Demonstration of visual jailbreaks triggering unsafe content.
Text-to-Image Benchmark Evaluation
On the T2ISafety benchmark, Nano Banana Pro achieves a safe rate of 52%, outperforming Seedream 4.5's 40%, though both struggle heavily with Violence and Disturbing content. Nano Banana Pro employs "implicit sanitization," modifying prompts to render safer images, whereas Seedream 4.5 relies on a "block-or-leak" strategy. When Seedream's filters are bypassed, it tends to generate highly toxic or abstractly disturbing "leakage" images rather than sanitized alternatives.
Figure 15: Evaluation results on the T2ISafety benchmark.
Figure 16: Samples of generated content evaluated for safety compliance.
T2I Adversarial Evaluation
Under advanced jailbreak attacks (PGJ and GenBreak), Nano Banana Pro demonstrates superior robustness with a worst-case safe rate of 54.00%, compared to Seedream 4.5's 19.67%. While Seedream 4.5 has higher refusal rates, it generates highly toxic content when its filters fail, whereas Nano Banana Pro keeps toxicity scores lower even during failures. Both models exhibit specific weaknesses, such as "scale blindness" (missing small background elements) and susceptibility to artistic style disguises in nudity.
Figure 17: Adversarial evaluation results.
Figure 18: Visual results of adversarial prompts bypassing safety filters.
T2I Compliance Evaluation
Evaluated against regulation-grounded frameworks, Nano Banana Pro achieves a higher compliance rate (65.59%) than Seedream 4.5 (57.53%). While both models reliably suppress explicit visual taboos like nudity, they share a fundamental "blindness" to abstract regulatory violations such as Intellectual Property Infringement and Political Subversion. These failures occur because the models struggle to infer the semantic intent or legal context of a request solely from pixel-level patterns.
Figure 19: Quantitative results on the regulatory compliance benchmark.
Figure 20: Examples illustrating copyright infringement and identity leakage risks.
Cite this report:
@article{xsafe2026safety,
title={A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5},
author={Xingjun Ma, Yixu Wang, Hengyuan Xu, Yutao Wu, Yifan Ding, Yunhan Zhao, Zilong Wang, Jiabin Hua, Ming Wen, Jianan Liu, Ranjie Duan, Yifeng Gao, Yingshui Tan, Yunhao Chen, Hui Xue, Xin Wang, Wei Cheng, Jingjing Chen, Zuxuan Wu, Bo Li, Yu-Gang Jiang},
journal={arXiv preprint arXiv:2601.10527},
year={2026}
}