Methodology

Helvetic AI Assurance Score (HAAS):
A Quality Metric from Switzerland, for Switzerland.

We evaluate AI systems using the Helvetic AI Assurance Score (HAAS), developed specifically for the Swiss context. The HAAS Score answers your questions about compliance, performance, reliability and security of AI systems: scientifically grounded, reproducible and reliable. The HAAS Score spans 8 dimensions grouped into 4 evaluation pillars: Compliance, Performance, Reliability and Security. Detailed benchmark results, scoring breakdowns, precise confidence intervals and methodology documentation are included so you can reproduce every result.

Service Model

Measurement → Diagnostic → Remediation

Every engagement follows a structured progression from objective scoring to actionable recommendations:

Measurement (Assurance Basic): Automated benchmarks produce scores, traffic-light dashboards, and key gap identification. You know where you stand.

Measurement + Diagnostic (Assurance Plus): Expert interpretation of benchmark results. Severity ranking, root cause analysis, confidence intervals, and remediation priorities. You know what matters most.

Measurement + Diagnostic + Remediation (Assurance Komplett): Evidence-based, best-practice-proven remediation prescriptions. Control mapping, owner assignment, implementation sequencing, and efficacy references. You know exactly what to do.

This three-stage model applies identically across all four pillars. The tier determines depth, not scope.

Clear boundaries: We measure, diagnose, and recommend. We do not implement remediations, validate as second-line defence, or certify compliance. The institution implements; the institution validates.

Scoring Framework

HAAS Score Measurement Dimensions

Performance

D1 Performance · D2 Robustness

Compliance

D3 Safety · D4 Compliance · D5 Swiss Languages · D6 Documentation

Reliability

D7 Production Reliability

Security

D8 Adversarial Security

Pillar 1: Compliant?

Regulatory AI conformity

What we test: Does your AI system meet regulatory requirements? Four HAAS dimensions cover the full compliance spectrum:

D3 Safety: Fairness, bias detection, protected attributes, equal treatment
D4 Compliance: Data protection, PII leakage, memorisation, societal alignment
D5 Swiss Languages: Multilingual interpretability, confidence calibration, AI disclosure
D6 Documentation: Model Cards, EU AI Act mapping, FINMA 08/2024, regulatory conformity

How we test: 29 Compl-AI benchmarks × 6 principles × 18 technical requirements. Automated scoring based on peer-reviewed methodology from ETH Zurich / INSAIT (2024). FINMA-specific scenarios for Swiss financial regulators. Transparency assessment following the Model Cards framework (Mitchell et al., 2019).

What you get: Article-specific scores (Art. 9–15), traffic-light dashboard, gap analysis with action prioritisation, FINMA risk heatmap.

Pillar 2: Performant?

AI model evaluation

What we test: Which model delivers the best results for your specific domain and language?

D1 Performance: Task completion, factual correctness, hallucination detection. Domain-specific scenarios from Swiss-Bench test real-world performance, not generic benchmarks.
D2 Robustness: Adversarial inputs, prompt injection resistance, stress testing. Behaviour under edge cases and adversarial conditions.

How we test: 800+ Swiss-specific evaluation scenarios from Swiss-Bench (Uenal, 2026a). DE/FR/IT across law, regulation, and public administration. Three-phase ground truth construction following OpenScholar (Asai et al., Nature 2026). Multi-judge ensemble scoring following Zheng et al. (NeurIPS 2023).

What you get: Model rankings, head-to-head comparisons with confidence intervals, Swiss language quality analysis, TCO comparison, selection recommendation.

Pillar 3: Reliable?

AI reliability in production

What we test: Does your AI system perform reliably when it truly matters: in production, under load, with real data?

D7 Production Reliability: Truthfulness and hallucination resistance (TruthfulQA), instruction following (IFEval), factual accuracy (SimpleQA), and context retrieval under long documents (NIAH)

How we test: Four Swiss-adapted benchmarks scored via structured grading with cross-model judging to avoid self-assessment bias (e.g., GPT-4o as judge for non-OpenAI models). Methodology published in Swiss-Bench SBP-003 (Uenal, 2026b). Swiss-specific items include Swiss misconceptions, Swiss formatting constraints, Swiss factual questions, and Fedlex legislative documents as retrieval context. Tested across 4 languages (DE/FR/IT/EN). Future updates will add function calling reliability (BFCL) and consistency via pass^k following the ReliabilityBench methodology (ETH Zurich / EPFL).

What you get: Truthfulness rates by category, instruction compliance metrics, factual accuracy breakdown, context retrieval quality, go/no-go recommendation with architecture proposals.

Pillar 4: Secure?

AI security & adversarial testing

What we test: Is your AI system protected against targeted attacks and misuse?

D8 Adversarial Security: Prompt injection resistance, jailbreak resistance, adversarial robustness, data leakage detection, attack surface assessment

How we test: Three Swiss-adapted security benchmarks, mapped to the OWASP Top 10 for LLMs and the MITRE ATLAS Framework. Swiss PII-Scope (271 items) for data leakage resistance, custom System Prompt Leakage probes (119 items) for prompt extraction attacks using Swiss regulatory system prompts, and Swiss German dialect comprehension (30 items) for dialectal safety bypass testing. Full methodology published in Swiss-Bench SBP-003 (Uenal, 2026b). Future updates will add baseline comparison runs with StrongREJECT, XSTest, WMDP, CyberSecEval 3, AgentDojo, and AgentHarm.

What you get: Vulnerability report with pass/fail per attack vector, OWASP coverage map, action prioritisation, detection coverage report.

Evaluation Infrastructure

The technical foundation

UK AI Security Institute Framework (MIT License)

The evaluation framework developed by the UK AI Security Institute, adopted by leading AI labs. Provides the infrastructure for reproducible model evaluations at scale. Over 100 built-in evaluation tasks with a proven architecture for systematic AI testing.

ETH Zurich / INSAIT Compliance Framework (Apache 2.0)

EU AI Act benchmarks mapped to 6 principles across 18 technical requirements. Provides the regulatory compliance scoring. Published, peer-reviewed methodology.

Swiss-Bench (Proprietary)

Over 800 Swiss-specific evaluation scenarios across 8 dimensions. Tests German, French, Italian and English comprehension on domain-specific tasks across law, regulation, public administration, reliability and security.

Model Selection

Systematic, transparent, reproducible

Every model in our evaluation roster is selected through a documented, four-criteria methodology:

Frontier Performance

Top-tier scores on independent, widely recognised benchmarks. We evaluate models at the technological frontier.

Swiss Market Prevalence

Models adopted or evaluated by Swiss financial institutions, insurers, and corporates.

Cost Feasibility

Per-token pricing compatible with production-grade evaluation (n≥100 per benchmark). Statistically meaningful sample sizes.

Ecosystem Coverage

Balanced representation across open-source and proprietary models, US, European, and Chinese providers.

Current roster: Leading frontier models, evaluated quarterly. See our Swiss-Bench leaderboard for current results.

Scientific Foundation

Peer-reviewed methodology

Our evaluation system combines five methodological layers, each grounded in peer-reviewed research:

Swiss-Bench (Performant: D1, D2): Our proprietary benchmark uses a three-phase ground truth construction pipeline inspired by OpenScholar (Asai et al., Nature, 2026). A 100-item subset was independently validated by a Swiss legal expert: 100% legal accuracy, 0% rated incorrect. Complete methodology in our (Uenal, 2026a).

Compl-AI (Compliant: D3–D6): EU AI Act compliance scoring following the COMPL-AI framework (ETH Zurich / INSAIT, 2024). Recognised by the OECD.

Inspect AI (Performant + Secure: D1, D2, D8): Evaluation framework of the UK AI Security Institute. Safety and adversarial testing.

Production Reliability Benchmarks (Reliable: D7): Four Swiss-adapted benchmarks measuring truthfulness (TruthfulQA, Lin et al., ACL 2022), instruction following (IFEval, Google Research, 2023), factual accuracy (SimpleQA, OpenAI, 2024), and context retrieval (NIAH). Scored via cross-model judging to avoid self-assessment bias. Extended with Swiss-specific items across 4 languages. Complete methodology in Swiss-Bench SBP-003 (Uenal, 2026b).

MITRE ATLAS + OWASP (Secure: D8): AI security testing mapped to the MITRE ATLAS Framework and the OWASP Top 10 for LLMs.

Additional methodological foundations: HELM (Stanford CRFM), Niklaus et al. (EMNLP 2023, ACL 2025), MMLU-Redux (NAACL 2025), CUAD (NeurIPS 2021), LegalBench (NeurIPS 2023). In total, our methodology draws on over 100 peer-reviewed publications.

Key finding (Asai et al., Nature, 2026): When LLMs cite legal articles, regulations, or case law, they fabricate references 78–90% of the time. Our scoring methodology explicitly evaluates citation precision, recall, and correctness.

Infrastructure

Sovereign AI Lab

Open-source and open-weight models run on our own hardware in Switzerland. Frontier models with over 600 billion parameters run locally. Proprietary models are evaluated via their providers’ APIs. Your data never leaves Switzerland.

Reference quality vs. production quality. We test models at full precision (FP8 reference) and at the quantization level used in production deployment. This comparison reveals logic deficits: reasoning degradation that remains invisible in cloud-based testing.

No customer data leaves Switzerland. This is not a policy. It is architecture.

Reproducibility & Independence

Transparent, verifiable, independent

Every evaluation follows a documented methodology with deterministic scoring. While LLM outputs have inherent variability, our structured scoring framework (temperature 0, fixed prompts, multi-judge majority vote) maximises consistency. You receive detailed benchmark results, scoring breakdowns, and full methodology documentation.

Helvetic AI has no commercial relationships with any AI model provider. No referral fees, no vendor partnerships, no pay-for-score agreements. Every model is evaluated with the same system.

This is not an opinion. It’s evidence.

References

Key publications

Asai, A. et al. “Citation correctness in large language models.” Nature, 2026.
Uenal, F. (2026a) “Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks.” ArXiv. arxiv.org/abs/2603.23646
Uenal, F. (2026b) “Swiss-Bench SBP-003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts.” ArXiv. arxiv.org/abs/2604.05872
Dobreva, R. et al. “Compliance assessment of LLMs against EU AI Act requirements.” 2024. (ETH Zurich / INSAIT)
Liang, P. et al. “Holistic Evaluation of Language Models (HELM).” TMLR, 2023. (Stanford CRFM)
UK AI Security Institute. “Evaluation framework for AI systems.” MIT License, 2024.
Niklaus, J. et al. “MultiLegalPile: a 689GB multilingual legal corpus.” EMNLP, 2023.
Niklaus, J. et al. “Swiss legal translation evaluation: 180,000+ translation pairs.” ACL, 2025.
Gema, A.P. et al. “MMLU-Redux: Fixing expert-written evaluation sets.” NAACL, 2025.
Hendrycks, D. et al. “CUAD: An expert-annotated NLP dataset for legal contract review.” NeurIPS, 2021.
Guha, N. et al. “LegalBench: A collaboratively built benchmark for measuring legal reasoning.” NeurIPS, 2023.
OECD. “AI risk management and governance frameworks.” OECD AI Policy Observatory, 2024.
Zheng, L. et al. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS, 2023.
Mitchell, M. et al. “Model Cards for Model Reporting.” FAT*, 2019.
Lin, S. et al. “TruthfulQA: Measuring How Models Mimic Human Falsehoods.” ACL, 2022.
Souly, A. et al. “A StrongREJECT for Empty Jailbreaks.” arXiv, 2024.
MITRE Corporation. “ATLAS: Adversarial Threat Landscape for AI Systems.” atlas.mitre.org
OWASP. “Top 10 for Large Language Model Applications.” 2025. owasp.org

Learn More

Questions about our methodology?

We’re happy to discuss our evaluation approach in detail, across all four pillars.

Helvetic AI Assurance Score (HAAS):A Quality Metric from Switzerland, for Switzerland.

Measurement → Diagnostic → Remediation

HAAS Score Measurement Dimensions

Performance

Compliance

Reliability

Security

Regulatory AI conformity

AI model evaluation

AI reliability in production

AI security & adversarial testing

The technical foundation

UK AI Security Institute Framework (MIT License)

ETH Zurich / INSAIT Compliance Framework (Apache 2.0)

Swiss-Bench (Proprietary)

Systematic, transparent, reproducible

Frontier Performance

Swiss Market Prevalence

Cost Feasibility

Ecosystem Coverage

Peer-reviewed methodology

Sovereign AI Lab

Transparent, verifiable, independent

Key publications

Questions about our methodology?

Helvetic AI Assurance Score (HAAS):
A Quality Metric from Switzerland, for Switzerland.