Swiss-Bench
Which AI model fits your Swiss use case?
10 models. 6 dimensions. 3 languages. 395 scenarios. Updated quarterly.
Last updated: Q1 2026 · Swiss-Bench v2.0
Leaderboard
Overall Model Rankings
#ModelHAAS
| Rank | Model | Type | HAAS | Status | Best At | Updated |
|---|---|---|---|---|---|---|
| 1 | Gemini 2.5 Flash | Closed Source | 60.1 | Ready | Documentation | Q1 2026 |
| 2 | Qwen 3.5 Plus | Open Source | 59.4 | Ready | Safety | Q1 2026 |
| 3 | Claude Sonnet 4 | Closed Source | 58.3 | Ready | Compliance | Q1 2026 |
| 4 | GLM 5 | Open Source | 55.5 | Evaluate | Documentation | Q1 2026 |
| 5 | MiniMax M2.5 | Open Source | 50.2 | Evaluate | Swiss Languages | Q1 2026 |
| 6 | GPT-oss 120B | Open Source | 49.6 | Evaluate | Compliance | Q1 2026 |
| 7 | MiMo-V2-Flash | Open Source | 48.7 | Evaluate | Performance | Q1 2026 |
| 8 | DeepSeek V3 | Open Source | 48.4 | Gap | Compliance | Q1 2026 |
| 9 | GPT-4o | Closed Source | 48.2 | Gap | Robustness | Q1 2026 |
| 10 | Mistral Large 3 | Open Source | 47.4 | Gap | Swiss Languages | Q1 2026 |
Swiss-Bench v2.0: 395 scenarios across Swiss Legal, FINMA Regulatory, and SFAO Audit domains. HAAS = Helvetic AI Assurance Score (6 dimensions, 0–100). Percentile-based classification across 10 models. Methodology →
Key Findings
Q1 2026 Highlights
Most Ready
Gemini 2.5 Flash
Highest HAAS score (60.1) across all 6 dimensions. Strongest in Documentation.
Best Open Source
Qwen 3.5 Plus
Top open-weight model (HAAS 59.4). Viable for on-premise deployment with full data sovereignty.
Strongest Compliance
Claude Sonnet 4
Highest D4 Compliance dimension score (80.1). Best fit for regulated environments requiring audit trail compliance.
Based on Swiss-Bench v2.0 (Q1 2026). 395 scenarios, 3-judge panel, structured scoring. Updated quarterly.
Detailed Results
Dimension, Language & Domain Breakdowns
HAAS Dimension Breakdown
| Model | D1 Perf. | D2 Robust. | D3 Safety | D4 Compl. | D5 Lang. | D6 Doc. |
|---|---|---|---|---|---|---|
| Gemini 2.5 Flash | 53.3 | 72.1 | 20.6 | 70.8 | 100 | 51.5 |
| Qwen 3.5 Plus | 51.5 | 77.1 | 33.3 | 55 | 100 | 51.1 |
| Claude Sonnet 4 | 41.2 | 88.4 | 9.5 | 80.1 | 93.6 | 35.2 |
| GLM 5 | 44.2 | 76.5 | 13.5 | 68.1 | 92.2 | 42.5 |
| MiniMax M2.5 | 37.4 | 71.7 | 6.3 | 67.9 | 94.4 | 25.4 |
| GPT-oss 120B | 31.5 | 78.9 | 2.4 | 72.8 | 93.1 | 16.8 |
| MiMo-V2-Flash | 38.8 | 67.8 | 3.2 | 68.8 | 89.3 | 22.3 |
| DeepSeek V3 | 35.9 | 67.8 | 2.4 | 69.4 | 89 | 27.5 |
| GPT-4o | 19.2 | 91.9 | 11.1 | 63.8 | 74.9 | 31.3 |
| Mistral Large 3 | 17.9 | 77.3 | 7.9 | 70.1 | 100 | 22.3 |
Visual Comparison
Gemini 2.5 Flash
Qwen 3.5 Plus
Claude Sonnet 4
GLM 5
MiniMax M2.5
GPT-oss 120B
MiMo-V2-Flash
DeepSeek V3
GPT-4o
Mistral Large 3
Per-Language Comparison
| Model | German (DE) | French (FR) | Italian (IT) |
|---|---|---|---|
| Gemini 2.5 Flash | 39.7% | 41.9% | 52.6% |
| Qwen 3.5 Plus | 45.3% | 41.6% | 51.5% |
| Claude Sonnet 4 | 27.3% | 33.4% | 42.8% |
| GLM 5 | 34.3% | 33.1% | 42.8% |
| MiniMax M2.5 | 26% | 24.7% | 34.5% |
| GPT-oss 120B | 16% | 19.6% | 28.9% |
| MiMo-V2-Flash | 20% | 24.7% | 29.4% |
| DeepSeek V3 | 18% | 25.7% | 39.2% |
| GPT-4o | 16% | 25% | 33.5% |
| Mistral Large 3 | 14.7% | 19.6% | 27.3% |
Per-Domain Comparison
| Model | Swiss Legal | FINMA | SFAO Audit |
|---|---|---|---|
| Gemini 2.5 Flash | 71.0% | 24.2% | 19.8% |
| Qwen 3.5 Plus | 70.7% | 29.2% | 16.7% |
| Claude Sonnet 4 | 60.4% | 12.9% | 14.6% |
| GLM 5 | 62.1% | 16.9% | 14.6% |
| MiniMax M2.5 | 50.6% | 9.0% | 15.6% |
| GPT-oss 120B | 42.6% | 4.8% | 1.0% |
| MiMo-V2-Flash | 48.2% | 6.2% | 5.2% |
| DeepSeek V3 | 50.0% | 9.0% | 5.2% |
| GPT-4o | 44.7% | 8.7% | 5.2% |
| Mistral Large 3 | 34.6% | 9.3% | 5.2% |
Swiss-Bench methodology and scoring criteria are documented on our Methodology page →
Our methodology, expert-verified ground truth, and statistical framework are described in our published ArXiv paper (Uenal, 2026).
Need scores for YOUR domain? Our AI Model Evaluation runs Swiss-Bench against your specific use case. 5-model comparison, domain-specific scenarios, actionable recommendation.
Contact
contact@ai-helvetic.ch
Ready for an independent evaluation?
Start with an AI Model Evaluation or a full SOTA Model Sweep. Within two weeks you'll know which model works best for your Swiss use case.