This report presents a comprehensive safety evaluation of the latest foundation models released in 2026, including GPT-5.2, Gemini 3 Pro, and others. We analyze safety alignment across text, vision-language, and text-to-image modalities, highlighting vulnerabilities in current safeguards against adversarial attacks and regulation compliance.
In standard safety benchmarks, GPT-5.2 achieves the highest macro-average safe rate of 91.59%, closely followed by Gemini 3 Pro at 88.06%. While most models perform well on refusal-based tasks like StrongREJECT, there is significant variance in social reasoning; for instance, Qwen3-VL struggles severely on the BBQ bias benchmark with a score of only 45.00%. This indicates that while frontier models have improved at rejecting explicit harmful instructions, they remain structurally weak in detecting subtle social biases and ensuring fairness.
Despite strong benchmark performance, models remain vulnerable to jailbreaks, with no model exceeding 85% worst-case safety. GPT-5.2 is the most robust (82.00% worst-case safety), whereas Doubao 1.8 and Grok 4.1 Fast exhibit catastrophic collapses, scoring only 12.00% and 4.00% respectively. The evaluation reveals that while static template-based attacks are often mitigated, models are highly susceptible to sophisticated, multi-turn agentic attacks (like X-Teaming) that progressively decompose harmful objectives.
GPT-5.2 leads the multimodal benchmark with a 92.14% safe rate, showing consistent generalization, while models like Grok 4.1 Fast and Doubao 1.8 lag significantly behind. Performance is uneven across risk types; models handle cross-modal misalignment well (SIUO) but struggle with culturally grounded implicit harm in memes, where scores drop as low as 44%. A key failure mode is "analytical operationalization," where models provide dangerous, actionable details under the guise of neutral academic analysis when processing visual inputs.
In adversarial settings, GPT-5.2 maintains an exceptional safe rate of 97.24%, while other models suffer sharp degradation, particularly on the VLJailbreakBench where Qwen3-VL and Gemini 3 Pro drop to approximately 60%. Models frequently fail due to "refusal drift" in multi-turn dialogues or when harmful intent is obscured by complex formatting and role-play. Doubao 1.8 is particularly brittle, often collapsing to direct policy-override instructions.
On the T2ISafety benchmark, Nano Banana Pro achieves a safe rate of 52%, outperforming Seedream 4.5's 40%, though both struggle heavily with Violence and Disturbing content. Nano Banana Pro employs "implicit sanitization," modifying prompts to render safer images, whereas Seedream 4.5 relies on a "block-or-leak" strategy. When Seedream's filters are bypassed, it tends to generate highly toxic or abstractly disturbing "leakage" images rather than sanitized alternatives.
Under advanced jailbreak attacks (PGJ and GenBreak), Nano Banana Pro demonstrates superior robustness with a worst-case safe rate of 54.00%, compared to Seedream 4.5's 19.67%. While Seedream 4.5 has higher refusal rates, it generates highly toxic content when its filters fail, whereas Nano Banana Pro keeps toxicity scores lower even during failures. Both models exhibit specific weaknesses, such as "scale blindness" (missing small background elements) and susceptibility to artistic style disguises in nudity.
Evaluated against regulation-grounded frameworks, Nano Banana Pro achieves a higher compliance rate (65.59%) than Seedream 4.5 (57.53%). While both models reliably suppress explicit visual taboos like nudity, they share a fundamental "blindness" to abstract regulatory violations such as Intellectual Property Infringement and Political Subversion. These failures occur because the models struggle to infer the semantic intent or legal context of a request solely from pixel-level patterns.
@article{xsafe2026safety,
title={A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Nano Banana Pro, and Seedream 4.5},
author={XSafe AI Team},
journal={Technical Report},
year={2026}
}