Which AI model fits your Swiss use case?

10 models. 8 dimensions. 3 languages. 800 scenarios. Updated quarterly.

Last updated: Q2 2026 · Swiss-Bench v3.0

Overall Model Rankings

#ModelHAAS
Swiss-Bench AI Model Rankings, Q2 2026 (10 models)
Rank Model Type HAAS Status Best At Updated
1 Qwen 3.5 Plus Open Source 64.5 Ready Safety Q2 2026
2 Claude Sonnet 4 Closed Source 61.2 Ready Compliance Q2 2026
3 GLM 5 Open Source 60.4 Ready Reliability Q2 2026
4 GPT-oss 120B Open Source 58.5 Evaluate Security Q2 2026
5 Gemini 2.5 Flash Closed Source 58 Evaluate Documentation Q2 2026
6 GPT-4o Closed Source 57.3 Evaluate Robustness Q2 2026
7 MiniMax M2.5 Open Source 54.8 Evaluate Security Q2 2026
8 MiMo-V2-Flash Open Source 54 Gap Performance Q2 2026
9 Mistral Large 3 Open Source 50.5 Gap Swiss Languages Q2 2026
10 DeepSeek V3 Open Source 50.3 Gap Compliance Q2 2026

HAAS Dimensions: D1 Performance (15%) · D2 Robustness (12%) · D3 Safety (10%) · D4 Compliance (15%) · D5 Swiss Language (8%) · D6 Documentation (2.5%) · D7 Production Reliability* (17.5%) · D8 Adversarial Security (20%)

*D7 scores are self-graded reliability proxies; full benchmark-based scoring in SBP-003 paper.

Each model is ranked by HAAS composite score and classified using percentile ranking: top 30% = Ready, middle 40% = Evaluate, bottom 30% = Gap.

Swiss-Bench v3.0: 800 scenarios across Swiss Legal, FINMA Regulatory, and SFAO Audit domains. HAAS = Helvetic AI Assurance Score (8 dimensions, 0–100). Percentile-based classification across 10 models. Methodology →

Q2 2026 Highlights

Most Ready
Qwen 3.5 Plus
Highest HAAS score (64.5) across all 8 dimensions. Strongest in Safety.
Best Open Source
Qwen 3.5 Plus
Top open-weight model (HAAS 64.5). Viable for on-premise deployment with full data sovereignty.
Strongest Compliance
Claude Sonnet 4
Highest D4 Compliance dimension score (80.1). Best fit for regulated environments requiring audit trail compliance.

Based on Swiss-Bench v3.0 (Q2 2026). 800 scenarios, 3-judge panel, structured scoring. Updated quarterly.

Dimension, Language & Domain Breakdowns

HAAS Dimension Breakdown

Model D1 Perf. D2 Robust. D3 Safety D4 Compl. D5 Lang. D6 Doc. D7 Reliab.* D8 Secur.
Qwen 3.5 Plus 51.5 77.1 33.3 55 100 51.1 94.4 50.6
Claude Sonnet 4 41.2 88.4 9.5 80.1 93.6 35.2 81.6 44.2
GLM 5 44.2 76.5 13.5 68.1 92.2 42.5 90.9 43.2
GPT-oss 120B 31.5 78.9 2.4 72.8 93.1 16.8 75.3 60.7
Gemini 2.5 Flash 53.3 72.1 20.6 70.8 100 51.5 79 27.9
GPT-4o 19.2 91.9 11.1 63.8 74.9 31.3 85.5 55.1
MiniMax M2.5 37.4 71.7 6.3 67.9 94.4 25.4 73.1 44
MiMo-V2-Flash 38.8 67.8 3.2 68.8 89.3 22.3 81 37.4
Mistral Large 3 17.9 77.3 7.9 70.1 100 22.3 80.2 23
DeepSeek V3 35.9 67.8 2.4 69.4 89 27.5 82 20.1

Visual Comparison

Qwen 3.5 Plus
D1
D2
D3
D4
D5
D6
D7
D8
Claude Sonnet 4
D1
D2
D3
D4
D5
D6
D7
D8
GLM 5
D1
D2
D3
D4
D5
D6
D7
D8
GPT-oss 120B
D1
D2
D3
D4
D5
D6
D7
D8
Gemini 2.5 Flash
D1
D2
D3
D4
D5
D6
D7
D8
GPT-4o
D1
D2
D3
D4
D5
D6
D7
D8
MiniMax M2.5
D1
D2
D3
D4
D5
D6
D7
D8
MiMo-V2-Flash
D1
D2
D3
D4
D5
D6
D7
D8
Mistral Large 3
D1
D2
D3
D4
D5
D6
D7
D8
DeepSeek V3
D1
D2
D3
D4
D5
D6
D7
D8

Per-Language Comparison

Model German (DE) French (FR) Italian (IT)
Qwen 3.5 Plus 45.3% 41.6% 51.5%
Claude Sonnet 4 27.3% 33.4% 42.8%
GLM 5 34.3% 33.1% 42.8%
GPT-oss 120B 16% 19.6% 28.9%
Gemini 2.5 Flash 39.7% 41.9% 52.6%
GPT-4o 16% 25% 33.5%
MiniMax M2.5 26% 24.7% 34.5%
MiMo-V2-Flash 20% 24.7% 29.4%
Mistral Large 3 14.7% 19.6% 27.3%
DeepSeek V3 18% 25.7% 39.2%

Per-Domain Comparison

Model Swiss Legal FINMA SFAO Audit
Qwen 3.5 Plus 70.7% 29.2% 16.7%
Claude Sonnet 4 60.4% 12.9% 14.6%
GLM 5 62.1% 16.9% 14.6%
GPT-oss 120B 42.6% 4.8% 1.0%
Gemini 2.5 Flash 71.0% 24.2% 19.8%
GPT-4o 44.7% 8.7% 5.2%
MiniMax M2.5 50.6% 9.0% 15.6%
MiMo-V2-Flash 48.2% 6.2% 5.2%
Mistral Large 3 34.6% 9.3% 5.2%
DeepSeek V3 50.0% 9.0% 5.2%

Get the full Swiss-Bench breakdown

HAAS dimension scores, per-language and per-domain comparisons with traffic-light classifications for all 10 models.

No spam. We only use your email to send quarterly updates. Unsubscribe anytime.
Swiss-Bench methodology and scoring criteria are documented on our Methodology page →

Our methodology, expert-verified ground truth, and statistical framework are described in our published research papers (Uenal, 2026a; Uenal, 2026b).

Need scores for YOUR domain? Our AI Model Evaluation runs Swiss-Bench against your specific use case. 5-model comparison, domain-specific scenarios, actionable recommendation.

Ready for an independent evaluation?

Start with an AI Model Evaluation or a full SOTA Model Sweep. Within two weeks you'll know which model works best for your Swiss use case.

Evaluation from CHF 8,000 · SOTA Sweep from CHF 20,000