The HAAS Score: 6 dimensions, transparent methodology
Every AI system we evaluate receives a Helvetic AI Assurance Score (HAAS) across 6 dimensions. Each dimension is scored 0–100 with confidence intervals. Our methodology is fully documented and every result is independently verifiable.
6 evaluation dimensions
D1: Performance (25%)
How accurate is the model on real Swiss tasks? Task completion, factual correctness, hallucination detection. Domain-specific scenarios from Swiss-Bench test real-world performance, not generic benchmarks.
D2: Robustness (20%)
Does the model hold up under pressure? Adversarial inputs, prompt injection resistance, stress testing. How does the model perform under edge cases and adversarial conditions?
D3: Safety (15%)
Can you trust what the model says? Hallucination detection, fabricated citation identification, harmful output avoidance. Tests whether models invent Swiss legal references or produce misleading regulatory guidance.
D4: Compliance (20%)
Does the model meet EU AI Act requirements? Technical compliance across applicable articles and technical requirements. Automated scoring built on peer-reviewed methodology from ETH Zurich.
D5: Swiss Language (10%)
Does the model handle DE/FR/IT correctly? Multilingual competence across German, French, and Italian. Language-specific accuracy and Swiss translation quality.
D6: Documentation (10%)
Is the model transparent and well-documented? Model card completeness and explanation quality following the Model Cards framework (Mitchell et al., 2019), evaluated using a structured checklist aligned with EU AI Act Art. 11 (completeness 60%, quality 40%).
Built on world-class evaluation science
UK AI Security Institute
Our evaluation infrastructure builds on the framework developed by the UK AI Security Institute and adopted by leading AI labs including Anthropic, Google DeepMind, and xAI.* Provides reproducible model evaluations at scale with over 100 built-in evaluation tasks.
ETH Zurich / INSAIT
EU AI Act compliance scoring built on peer-reviewed methodology from ETH Zurich, mapping regulatory principles to technical requirements. Recognised by the OECD.
Swiss-Bench
395 proprietary Swiss-specific evaluation scenarios in German, French, and Italian. Tests domain knowledge, multilingual competence, and regulatory understanding across Swiss legal, financial, and administrative contexts.
Systematic, transparent, reproducible
Every model in our evaluation roster is selected through a documented, four-criteria methodology:
Frontier Performance
Top-tier scores on independent, widely recognised benchmarks. We evaluate models that compete at the frontier, not legacy systems.
Swiss Market Prevalence
Models adopted or considered by Swiss financial institutions, insurers, and corporates. We evaluate what your organisation is likely to deploy.
Cost Feasibility
Per-token pricing compatible with statistically meaningful evaluation. We run rigorous sample sizes, not toy demos.
Ecosystem Coverage
Balanced representation across major providers, open-source and closed-source, US, European, and Chinese models.
Peer-reviewed methodology
Our evaluation system combines three methodological layers, each grounded in peer-reviewed research:
Swiss-Bench (D1 Performance): Our proprietary benchmark uses a three-phase ground truth construction pipeline inspired by the data curation methodology of OpenScholar (Asai et al., Nature, 2026): expert drafting from primary statutory sources, adversarial verification, and quality-gated filtering. A 100-item subset was independently validated by a Swiss legal expert (MLaw, University of Fribourg), achieving 100% Legal Accuracy and 0% rated Incorrect. Model responses are scored by a blind three-judge LLM panel across three dimensions (Legal Accuracy, Citation Accuracy, Completeness) with majority-vote aggregation, following the multi-judge ensemble methodology of Zheng et al. (NeurIPS, 2023). The complete Swiss-Bench methodology is documented in our published ArXiv paper (Uenal, 2026).
Compl-AI (D2 Robustness, D4 Compliance): EU AI Act compliance scoring adapts the COMPL-AI framework (ETH Zurich / INSAIT, 2024), mapping regulatory principles to technical requirements. Recognized by the OECD.
Inspect AI (D3 Safety): Safety and adversarial testing builds on the UK AI Security Institute evaluation framework, adopted by leading AI labs.
Documentation (D6): Transparency assessment follows the Model Cards framework (Mitchell et al., 2019), operationalized as a structured checklist aligned with EU AI Act Article 11 technical documentation requirements.
Our holistic evaluation philosophy follows HELM (Stanford CRFM, peer-reviewed in TMLR). Swiss legal translation evaluation (D5) builds on methodology validated by Niklaus et al. (EMNLP 2023, ACL 2025) covering 180,000+ Swiss legal translation pairs. Related benchmarking work includes MMLU-Redux (Gema et al., NAACL 2025), CUAD (Hendrycks et al., NeurIPS 2021), and LegalBench (Guha et al., NeurIPS 2023). In total, our methodology draws on 100+ peer-reviewed publications.
Transparent and verifiable
Every evaluation follows a documented methodology with deterministic scoring. While LLM outputs have inherent variability, our structured scoring framework (temperature 0, fixed prompts, multi-judge majority vote) maximises consistency. You receive detailed benchmark results, scoring breakdowns, and full methodology documentation, sufficient to verify and understand every finding.
This is not an opinion. It’s evidence.
No conflicts of interest
Helvetic AI has no commercial relationships with any AI model provider. No referral fees, no vendor partnerships, no pay-for-score agreements. Every model is evaluated with the same system, the same benchmarks, and the same scoring methodology.
Key publications
- Asai, A. et al. “Citation correctness in large language models.” Nature, 2026.
- Uenal, F. “Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks.” ArXiv, 2026.
- Dobreva, R. et al. “Compliance assessment of LLMs against EU AI Act requirements.” ETH Zürich / INSAIT, 2024.
- Liang, P. et al. “Holistic Evaluation of Language Models (HELM).” TMLR, 2023. (Stanford CRFM)
- UK AI Security Institute. “Evaluation framework for AI systems.” 2024.
- Niklaus, J. et al. “MultiLegalPile: a 689GB multilingual legal corpus.” EMNLP, 2023.
- Niklaus, J. et al. “Swiss legal translation evaluation: 180,000+ translation pairs.” ACL, 2025.
- Gema, A.P. et al. “MMLU-Redux: Fixing expert-written evaluation sets.” NAACL, 2025.
- Hendrycks, D. et al. “CUAD: An expert-annotated NLP dataset for legal contract review.” NeurIPS, 2021.
- Guha, N. et al. “LegalBench: A collaboratively built benchmark for measuring legal reasoning.” NeurIPS, 2023.
- OECD. “AI risk management and governance frameworks.” OECD AI Policy Observatory, 2024.
- Zheng, L. et al. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS, 2023.
- Mitchell, M. et al. “Model Cards for Model Reporting.” FAT*, 2019.
Questions about our methodology?
We're happy to discuss our evaluation approach in detail.