Methodology

The HAAS Score: 6 dimensions, transparent methodology

Every AI system we evaluate receives a Helvetic AI Assurance Score (HAAS) across 6 dimensions. Each dimension is scored 0–100 with confidence intervals. Our methodology is fully documented and every result is independently verifiable.

Scoring Framework

6 evaluation dimensions

D1: Performance (25%)

How accurate is the model on real Swiss tasks? Task completion, factual correctness, hallucination detection. Domain-specific scenarios from Swiss-Bench test real-world performance, not generic benchmarks.

D2: Robustness (20%)

Does the model hold up under pressure? Adversarial inputs, prompt injection resistance, stress testing. How does the model perform under edge cases and adversarial conditions?

D3: Safety (15%)

Can you trust what the model says? Hallucination detection, fabricated citation identification, harmful output avoidance. Tests whether models invent Swiss legal references or produce misleading regulatory guidance.

D4: Compliance (20%)

Does the model meet EU AI Act requirements? Technical compliance across applicable articles and technical requirements. Automated scoring built on peer-reviewed methodology from ETH Zurich.

D5: Swiss Language (10%)

Does the model handle DE/FR/IT correctly? Multilingual competence across German, French, and Italian. Language-specific accuracy and Swiss translation quality.

D6: Documentation (10%)

Is the model transparent and well-documented? Model card completeness and explanation quality following the Model Cards framework (Mitchell et al., 2019), evaluated using a structured checklist aligned with EU AI Act Art. 11 (completeness 60%, quality 40%).

System Stack

Built on world-class evaluation science

UK AI Security Institute (evaluation framework)

Our evaluation infrastructure builds on the framework developed by the UK AI Security Institute and adopted by leading AI labs including Anthropic, Google DeepMind, and xAI.* Provides reproducible model evaluations at scale with over 100 built-in evaluation tasks.

ETH Zurich / INSAIT (compliance framework)

EU AI Act compliance scoring built on peer-reviewed methodology from ETH Zurich, mapping regulatory principles to technical requirements. Recognised by the OECD.

Swiss-Bench (Proprietary)

395 proprietary Swiss-specific evaluation scenarios in German, French, and Italian. Tests domain knowledge, multilingual competence, and regulatory understanding across Swiss legal, financial, and administrative contexts.

* Adoption reported by Hamel Husain, Inspect AI: An OSS Python Library For LLM Evals.

Model Selection

Systematic, transparent, reproducible

Every model in our evaluation roster is selected through a documented, four-criteria methodology:

Frontier Performance

Top-tier scores on independent, widely recognised benchmarks. We evaluate models that compete at the frontier, not legacy systems.

Swiss Market Prevalence

Models adopted or considered by Swiss financial institutions, insurers, and corporates. We evaluate what your organisation is likely to deploy.

Cost Feasibility

Per-token pricing compatible with statistically meaningful evaluation. We run rigorous sample sizes, not toy demos.

Ecosystem Coverage

Balanced representation across major providers, open-source and closed-source, US, European, and Chinese models.

Current roster: leading frontier models from major providers, evaluated quarterly. Model selection rationale and exclusion reasons are documented and available upon request. See our Swiss-Bench leaderboard for current results.

Scientific Foundation

Peer-reviewed methodology

Our evaluation system combines three methodological layers, each grounded in peer-reviewed research:

Swiss-Bench (D1 Performance): Our proprietary benchmark uses a three-phase ground truth construction pipeline inspired by the data curation methodology of OpenScholar (Asai et al., Nature, 2026): expert drafting from primary statutory sources, adversarial verification, and quality-gated filtering. A 100-item subset was independently validated by a Swiss legal expert (MLaw, University of Fribourg), achieving 100% Legal Accuracy and 0% rated Incorrect. Model responses are scored by a blind three-judge LLM panel across three dimensions (Legal Accuracy, Citation Accuracy, Completeness) with majority-vote aggregation, following the multi-judge ensemble methodology of Zheng et al. (NeurIPS, 2023). The complete Swiss-Bench methodology is documented in our published ArXiv paper (Uenal, 2026).

Compl-AI (D2 Robustness, D4 Compliance): EU AI Act compliance scoring adapts the COMPL-AI framework (ETH Zurich / INSAIT, 2024), mapping regulatory principles to technical requirements. Recognized by the OECD.

Inspect AI (D3 Safety): Safety and adversarial testing builds on the UK AI Security Institute evaluation framework, adopted by leading AI labs.

Documentation (D6): Transparency assessment follows the Model Cards framework (Mitchell et al., 2019), operationalized as a structured checklist aligned with EU AI Act Article 11 technical documentation requirements.

Our holistic evaluation philosophy follows HELM (Stanford CRFM, peer-reviewed in TMLR). Swiss legal translation evaluation (D5) builds on methodology validated by Niklaus et al. (EMNLP 2023, ACL 2025) covering 180,000+ Swiss legal translation pairs. Related benchmarking work includes MMLU-Redux (Gema et al., NAACL 2025), CUAD (Hendrycks et al., NeurIPS 2021), and LegalBench (Guha et al., NeurIPS 2023). In total, our methodology draws on 100+ peer-reviewed publications.

Why citation accuracy matters (Asai et al., Nature, 2026): When LLMs cite legal articles, regulations, or case law, they fabricate references 78–90% of the time. Our Swiss-Bench scoring framework evaluates citation correctness as a dedicated dimension, not just whether the answer sounds plausible.

Reproducibility

Transparent and verifiable

Every evaluation follows a documented methodology with deterministic scoring. While LLM outputs have inherent variability, our structured scoring framework (temperature 0, fixed prompts, multi-judge majority vote) maximises consistency. You receive detailed benchmark results, scoring breakdowns, and full methodology documentation, sufficient to verify and understand every finding.

This is not an opinion. It’s evidence.

Independence

No conflicts of interest

Helvetic AI has no commercial relationships with any AI model provider. No referral fees, no vendor partnerships, no pay-for-score agreements. Every model is evaluated with the same system, the same benchmarks, and the same scoring methodology.

References

Key publications

Learn More

Questions about our methodology?

We're happy to discuss our evaluation approach in detail.

contact@ai-helvetic.ch