Swiss-Bench

Which AI model fits your Swiss use case?

Name: Swiss-Bench
Creator: Helvetic AI

10 models. 6 dimensions. 3 languages. 395 scenarios. Updated quarterly.

Last updated: Q1 2026 · Swiss-Bench v2.0

Leaderboard

Overall Model Rankings

Swiss-Bench AI Model Rankings, Q1 2026 (10 models)
Rank	Model	Type	HAAS	Status	Best At	Updated
1	Gemini 2.5 Flash	Closed Source	60.1	Ready	Documentation	Q1 2026
2	Qwen 3.5 Plus	Open Source	59.4	Ready	Safety	Q1 2026
3	Claude Sonnet 4	Closed Source	58.3	Ready	Compliance	Q1 2026
4	GLM 5	Open Source	55.5	Evaluate	Documentation	Q1 2026
5	MiniMax M2.5	Open Source	50.2	Evaluate	Swiss Languages	Q1 2026
6	GPT-oss 120B	Open Source	49.6	Evaluate	Compliance	Q1 2026
7	MiMo-V2-Flash	Open Source	48.7	Evaluate	Performance	Q1 2026
8	DeepSeek V3	Open Source	48.4	Gap	Compliance	Q1 2026
9	GPT-4o	Closed Source	48.2	Gap	Robustness	Q1 2026
10	Mistral Large 3	Open Source	47.4	Gap	Swiss Languages	Q1 2026

HAAS Dimensions: D1 Performance (25%) · D2 Robustness (20%) · D3 Safety (15%) · D4 Compliance (20%) · D5 Swiss Language (10%) · D6 Documentation (10%)

Each model is ranked by HAAS composite score and classified using percentile ranking: top 30% = Ready, middle 40% = Evaluate, bottom 30% = Gap.

Swiss-Bench v2.0: 395 scenarios across Swiss Legal, FINMA Regulatory, and SFAO Audit domains. HAAS = Helvetic AI Assurance Score (6 dimensions, 0–100). Percentile-based classification across 10 models. Methodology →

Key Findings

Q1 2026 Highlights

Most Ready

Gemini 2.5 Flash

Highest HAAS score (60.1) across all 6 dimensions. Strongest in Documentation.

Best Open Source

Qwen 3.5 Plus

Top open-weight model (HAAS 59.4). Viable for on-premise deployment with full data sovereignty.

Strongest Compliance

Claude Sonnet 4

Highest D4 Compliance dimension score (80.1). Best fit for regulated environments requiring audit trail compliance.

Based on Swiss-Bench v2.0 (Q1 2026). 395 scenarios, 3-judge panel, structured scoring. Updated quarterly.

Detailed Results

Dimension, Language & Domain Breakdowns

HAAS Dimension Breakdown

Model	D1 Perf.	D2 Robust.	D3 Safety	D4 Compl.	D5 Lang.	D6 Doc.
Gemini 2.5 Flash	53.3	72.1	20.6	70.8	100	51.5
Qwen 3.5 Plus	51.5	77.1	33.3	55	100	51.1
Claude Sonnet 4	41.2	88.4	9.5	80.1	93.6	35.2
GLM 5	44.2	76.5	13.5	68.1	92.2	42.5
MiniMax M2.5	37.4	71.7	6.3	67.9	94.4	25.4
GPT-oss 120B	31.5	78.9	2.4	72.8	93.1	16.8
MiMo-V2-Flash	38.8	67.8	3.2	68.8	89.3	22.3
DeepSeek V3	35.9	67.8	2.4	69.4	89	27.5
GPT-4o	19.2	91.9	11.1	63.8	74.9	31.3
Mistral Large 3	17.9	77.3	7.9	70.1	100	22.3

Visual Comparison

Gemini 2.5 Flash

D1

D2

D3

D4

D5

D6

Qwen 3.5 Plus

D1

D2

D3

D4

D5

D6

Claude Sonnet 4

D1

D2

D3

D4

D5

D6

GLM 5

D1

D2

D3

D4

D5

D6

MiniMax M2.5

D1

D2

D3

D4

D5

D6

GPT-oss 120B

D1

D2

D3

D4

D5

D6

MiMo-V2-Flash

D1

D2

D3

D4

D5

D6

DeepSeek V3

D1

D2

D3

D4

D5

D6

GPT-4o

D1

D2

D3

D4

D5

D6

Mistral Large 3

D1

D2

D3

D4

D5

D6

Per-Language Comparison

Model	German (DE)	French (FR)	Italian (IT)
Gemini 2.5 Flash	39.7%	41.9%	52.6%
Qwen 3.5 Plus	45.3%	41.6%	51.5%
Claude Sonnet 4	27.3%	33.4%	42.8%
GLM 5	34.3%	33.1%	42.8%
MiniMax M2.5	26%	24.7%	34.5%
GPT-oss 120B	16%	19.6%	28.9%
MiMo-V2-Flash	20%	24.7%	29.4%
DeepSeek V3	18%	25.7%	39.2%
GPT-4o	16%	25%	33.5%
Mistral Large 3	14.7%	19.6%	27.3%

Per-Domain Comparison

Model	Swiss Legal	FINMA	SFAO Audit
Gemini 2.5 Flash	71.0%	24.2%	19.8%
Qwen 3.5 Plus	70.7%	29.2%	16.7%
Claude Sonnet 4	60.4%	12.9%	14.6%
GLM 5	62.1%	16.9%	14.6%
MiniMax M2.5	50.6%	9.0%	15.6%
GPT-oss 120B	42.6%	4.8%	1.0%
MiMo-V2-Flash	48.2%	6.2%	5.2%
DeepSeek V3	50.0%	9.0%	5.2%
GPT-4o	44.7%	8.7%	5.2%
Mistral Large 3	34.6%	9.3%	5.2%

Get the full Swiss-Bench breakdown

HAAS dimension scores, per-language and per-domain comparisons with traffic-light classifications for all 10 models.

No spam. We only use your email to send quarterly updates. Unsubscribe anytime.

Swiss-Bench methodology and scoring criteria are documented on our Methodology page →

Our methodology, expert-verified ground truth, and statistical framework are described in our published ArXiv paper (Uenal, 2026).

Need scores for YOUR domain? Our AI Model Evaluation runs Swiss-Bench against your specific use case. 5-model comparison, domain-specific scenarios, actionable recommendation.

Contact

Ready for an independent evaluation?

Start with an AI Model Evaluation or a full SOTA Model Sweep. Within two weeks you'll know which model works best for your Swiss use case.

Evaluation from CHF 8,000 · SOTA Sweep from CHF 20,000

contact@ai-helvetic.ch