A Safety Report
ON GPT-5.2, GEMINI 3 PRO, QWEN3-VL, DOUBAO 1.8,
GROK 4.1 FAST, NANO BANANA PRO, AND SEEDREAM 4.5

Xingjun Ma1,2   Yixu Wang1   Hengyuan Xu1   Yutao Wu3   Yifan Ding1   Yunhan Zhao1   Zilong Wang1  
Jiabin Hua1   Ming Wen1,2   Jianan Liu1,2   Ranjie Duan   Yifeng Gao1   Yingshui Tan   Yunhao Chen1  
Hui Xue   Xin Wang1   Wei Cheng   Jingjing Chen1   Zuxuan Wu1   Bo Li4   Yu-Gang Jiang1
Fudan University1    Shanghai Innovation Institute2   
Deakin University3    UIUC4   

📊 Comprehensive Safety Evaluation across major modalities and latest foundation models.

Abstract

This report presents a comprehensive safety evaluation of the latest foundation models released in 2026, including GPT-5.2, Gemini 3 Pro, and others. We analyze safety alignment across text, vision-language, and text-to-image modalities, highlighting vulnerabilities in current safeguards against adversarial attacks and regulation compliance.

Large Language Model Benchmark Evaluation

In standard safety benchmarks, GPT-5.2 achieves the highest macro-average safe rate of 91.59%, closely followed by Gemini 3 Pro at 88.06%. While most models perform well on refusal-based tasks like StrongREJECT, there is significant variance in social reasoning; for instance, Qwen3-VL struggles severely on the BBQ bias benchmark with a score of only 45.00%. This indicates that while frontier models have improved at rejecting explicit harmful instructions, they remain structurally weak in detecting subtle social biases and ensuring fairness.

Quantitative Results
LLM Benchmark Results
Figure 3: Safe rate of five models across five benchmarks.
Case Studies
LLM Benchmark Examples
Figure 4: Examples of model responses to standard safety benchmark prompts.

LLM Adversarial Evaluation

Despite strong benchmark performance, models remain vulnerable to jailbreaks, with no model exceeding 85% worst-case safety. GPT-5.2 is the most robust (82.00% worst-case safety), whereas Doubao 1.8 and Grok 4.1 Fast exhibit catastrophic collapses, scoring only 12.00% and 4.00% respectively. The evaluation reveals that while static template-based attacks are often mitigated, models are highly susceptible to sophisticated, multi-turn agentic attacks (like X-Teaming) that progressively decompose harmful objectives.

Attack Success Rates
LLM Adversarial Results
Figure 5: Adversarial evaluation results across five models.
Adversarial Examples
Adversarial Bad Examples
Figure 6: Qualitative examples of successful jailbreak attempts.

Vision-Language Model Benchmark Evaluation

GPT-5.2 leads the multimodal benchmark with a 92.14% safe rate, showing consistent generalization, while models like Grok 4.1 Fast and Doubao 1.8 lag significantly behind. Performance is uneven across risk types; models handle cross-modal misalignment well (SIUO) but struggle with culturally grounded implicit harm in memes, where scores drop as low as 44%. A key failure mode is "analytical operationalization," where models provide dangerous, actionable details under the guise of neutral academic analysis when processing visual inputs.

Benchmark Performance
VLM Benchmark Results
Figure 11: Safety scores for Vision-Language Models on standard benchmarks.
Visual Prompt Examples
VLM Benchmark Examples
Figure 12: Examples of VLM responses to unsafe visual and textual inputs.

VLM Adversarial Evaluation

In adversarial settings, GPT-5.2 maintains an exceptional safe rate of 97.24%, while other models suffer sharp degradation, particularly on the VLJailbreakBench where Qwen3-VL and Gemini 3 Pro drop to approximately 60%. Models frequently fail due to "refusal drift" in multi-turn dialogues or when harmful intent is obscured by complex formatting and role-play. Doubao 1.8 is particularly brittle, often collapsing to direct policy-override instructions.

Adversarial Robustness
VLM Adversarial Results
Figure 13: Safe rates under multimodal adversarial evaluation.
Visual Jailbreaks
VLM Adversarial Examples
Figure 14: Demonstration of visual jailbreaks triggering unsafe content.

Text-to-Image Benchmark Evaluation

On the T2ISafety benchmark, Nano Banana Pro achieves a safe rate of 52%, outperforming Seedream 4.5's 40%, though both struggle heavily with Violence and Disturbing content. Nano Banana Pro employs "implicit sanitization," modifying prompts to render safer images, whereas Seedream 4.5 relies on a "block-or-leak" strategy. When Seedream's filters are bypassed, it tends to generate highly toxic or abstractly disturbing "leakage" images rather than sanitized alternatives.

Safety Metrics
T2I Benchmark Results
Figure 15: Evaluation results on the T2ISafety benchmark.
Generation Samples
T2I Benchmark Examples
Figure 16: Samples of generated content evaluated for safety compliance.

T2I Adversarial Evaluation

Under advanced jailbreak attacks (PGJ and GenBreak), Nano Banana Pro demonstrates superior robustness with a worst-case safe rate of 54.00%, compared to Seedream 4.5's 19.67%. While Seedream 4.5 has higher refusal rates, it generates highly toxic content when its filters fail, whereas Nano Banana Pro keeps toxicity scores lower even during failures. Both models exhibit specific weaknesses, such as "scale blindness" (missing small background elements) and susceptibility to artistic style disguises in nudity.

Bypass Success Rates
T2I Adversarial Results
Figure 17: Adversarial evaluation results.
Attack Visualizations
T2I Adversarial Examples
Figure 18: Visual results of adversarial prompts bypassing safety filters.

T2I Compliance Evaluation

Evaluated against regulation-grounded frameworks, Nano Banana Pro achieves a higher compliance rate (65.59%) than Seedream 4.5 (57.53%). While both models reliably suppress explicit visual taboos like nudity, they share a fundamental "blindness" to abstract regulatory violations such as Intellectual Property Infringement and Political Subversion. These failures occur because the models struggle to infer the semantic intent or legal context of a request solely from pixel-level patterns.

Compliance Scores
T2I Regulation Results
Figure 19: Quantitative results on the regulatory compliance benchmark.
Regulatory Case Studies
T2I Regulation Examples
Figure 20: Examples illustrating copyright infringement and identity leakage risks.

Cite this report:

@article{xsafe2026safety,
  title={A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Nano Banana Pro, and Seedream 4.5},
  author={XSafe AI Team},
  journal={Technical Report},
  year={2026}
}