Swiss-Bench

Which AI model fits your Swiss use case?

Name: Swiss-Bench
Creator: Helvetic AI

10 models. 8 dimensions. 3 languages. 800 scenarios. Updated quarterly.

Last updated: Q2 2026 · Swiss-Bench v3.0

Leaderboard

Overall Model Rankings

Swiss-Bench AI Model Rankings, Q2 2026 (10 models)
Rank	Model	Type	HAAS	Status	Best At	Updated
1	Qwen 3.5 Plus	Open Source	64.5	Ready	Safety	Q2 2026
2	Claude Sonnet 4	Closed Source	61.2	Ready	Compliance	Q2 2026
3	GLM 5	Open Source	60.4	Ready	Reliability	Q2 2026
4	GPT-oss 120B	Open Source	58.5	Evaluate	Security	Q2 2026
5	Gemini 2.5 Flash	Closed Source	58	Evaluate	Documentation	Q2 2026
6	GPT-4o	Closed Source	57.3	Evaluate	Robustness	Q2 2026
7	MiniMax M2.5	Open Source	54.8	Evaluate	Security	Q2 2026
8	MiMo-V2-Flash	Open Source	54	Gap	Performance	Q2 2026
9	Mistral Large 3	Open Source	50.5	Gap	Swiss Languages	Q2 2026
10	DeepSeek V3	Open Source	50.3	Gap	Compliance	Q2 2026

HAAS Dimensions: D1 Performance (15%) · D2 Robustness (12%) · D3 Safety (10%) · D4 Compliance (15%) · D5 Swiss Language (8%) · D6 Documentation (2.5%) · D7 Production Reliability* (17.5%) · D8 Adversarial Security (20%)

*D7 scores are self-graded reliability proxies; full benchmark-based scoring in SBP-003 paper.

Each model is ranked by HAAS composite score and classified using percentile ranking: top 30% = Ready, middle 40% = Evaluate, bottom 30% = Gap.

Swiss-Bench v3.0: 800 scenarios across Swiss Legal, FINMA Regulatory, and SFAO Audit domains. HAAS = Helvetic AI Assurance Score (8 dimensions, 0–100). Percentile-based classification across 10 models. Methodology →

Key Findings

Q2 2026 Highlights

Most Ready

Qwen 3.5 Plus

Highest HAAS score (64.5) across all 8 dimensions. Strongest in Safety.

Best Open Source

Qwen 3.5 Plus

Top open-weight model (HAAS 64.5). Viable for on-premise deployment with full data sovereignty.

Strongest Compliance

Claude Sonnet 4

Highest D4 Compliance dimension score (80.1). Best fit for regulated environments requiring audit trail compliance.

Based on Swiss-Bench v3.0 (Q2 2026). 800 scenarios, 3-judge panel, structured scoring. Updated quarterly.

Detailed Results

Dimension, Language & Domain Breakdowns

HAAS Dimension Breakdown

Model	D1 Perf.	D2 Robust.	D3 Safety	D4 Compl.	D5 Lang.	D6 Doc.	D7 Reliab.*	D8 Secur.
Qwen 3.5 Plus	51.5	77.1	33.3	55	100	51.1	94.4	50.6
Claude Sonnet 4	41.2	88.4	9.5	80.1	93.6	35.2	81.6	44.2
GLM 5	44.2	76.5	13.5	68.1	92.2	42.5	90.9	43.2
GPT-oss 120B	31.5	78.9	2.4	72.8	93.1	16.8	75.3	60.7
Gemini 2.5 Flash	53.3	72.1	20.6	70.8	100	51.5	79	27.9
GPT-4o	19.2	91.9	11.1	63.8	74.9	31.3	85.5	55.1
MiniMax M2.5	37.4	71.7	6.3	67.9	94.4	25.4	73.1	44
MiMo-V2-Flash	38.8	67.8	3.2	68.8	89.3	22.3	81	37.4
Mistral Large 3	17.9	77.3	7.9	70.1	100	22.3	80.2	23
DeepSeek V3	35.9	67.8	2.4	69.4	89	27.5	82	20.1

Visual Comparison

Qwen 3.5 Plus

D1

D2

D3

D4

D5

D6

D7

D8

Claude Sonnet 4

D1

D2

D3

D4

D5

D6

D7

D8

GLM 5

D1

D2

D3

D4

D5

D6

D7

D8

GPT-oss 120B

D1

D2

D3

D4

D5

D6

D7

D8

Gemini 2.5 Flash

D1

D2

D3

D4

D5

D6

D7

D8

GPT-4o

D1

D2

D3

D4

D5

D6

D7

D8

MiniMax M2.5

D1

D2

D3

D4

D5

D6

D7

D8

MiMo-V2-Flash

D1

D2

D3

D4

D5

D6

D7

D8

Mistral Large 3

D1

D2

D3

D4

D5

D6

D7

D8

DeepSeek V3

D1

D2

D3

D4

D5

D6

D7

D8

Per-Language Comparison

Model	German (DE)	French (FR)	Italian (IT)
Qwen 3.5 Plus	45.3%	41.6%	51.5%
Claude Sonnet 4	27.3%	33.4%	42.8%
GLM 5	34.3%	33.1%	42.8%
GPT-oss 120B	16%	19.6%	28.9%
Gemini 2.5 Flash	39.7%	41.9%	52.6%
GPT-4o	16%	25%	33.5%
MiniMax M2.5	26%	24.7%	34.5%
MiMo-V2-Flash	20%	24.7%	29.4%
Mistral Large 3	14.7%	19.6%	27.3%
DeepSeek V3	18%	25.7%	39.2%

Per-Domain Comparison

Model	Swiss Legal	FINMA	SFAO Audit
Qwen 3.5 Plus	70.7%	29.2%	16.7%
Claude Sonnet 4	60.4%	12.9%	14.6%
GLM 5	62.1%	16.9%	14.6%
GPT-oss 120B	42.6%	4.8%	1.0%
Gemini 2.5 Flash	71.0%	24.2%	19.8%
GPT-4o	44.7%	8.7%	5.2%
MiniMax M2.5	50.6%	9.0%	15.6%
MiMo-V2-Flash	48.2%	6.2%	5.2%
Mistral Large 3	34.6%	9.3%	5.2%
DeepSeek V3	50.0%	9.0%	5.2%

Get the full Swiss-Bench breakdown

HAAS dimension scores, per-language and per-domain comparisons with traffic-light classifications for all 10 models.

No spam. We only use your email to send quarterly updates. Unsubscribe anytime.

Swiss-Bench methodology and scoring criteria are documented on our Methodology page →

Our methodology, expert-verified ground truth, and statistical framework are described in our published research papers (Uenal, 2026a; Uenal, 2026b).

Need scores for YOUR domain? Our AI Model Evaluation runs Swiss-Bench against your specific use case. 5-model comparison, domain-specific scenarios, actionable recommendation.

Contact

Ready for an independent evaluation?

Start with an AI Model Evaluation or a full SOTA Model Sweep. Within two weeks you'll know which model works best for your Swiss use case.

Evaluation from CHF 8,000 · SOTA Sweep from CHF 20,000