Methodology

How the leaderboard comes about.

This page describes the evaluation procedure in full. The standalone pipeline produces a reproducible JSONL dataset for each run; this is publicly accessible in the audit trail and replaces any claim that is not backed by data.

How to read the leaderboard — the most important rule: Every score carries a 95% bootstrap confidence interval. If two models overlap in their CI bands (typically ±2 points at the current sample size), they are shown as statistically tied — same rank position, same bar height, shared caption. An ordering is only asserted once the CI bands no longer overlap. Details below under "Bootstrap confidence intervals".

Bootstrap confidence intervals & rank ties

For each (model × domain), 1000 bootstrap resamples are drawn on the cell scores, and the 95% percentile interval is published. No point value without an uncertainty figure — a score difference that lies within the CI bands is not statistically distinguishable and is therefore not asserted as a ranking difference either.

A concrete example from Run #7: Mistral Large 2 (top100 = 45) and Gemini 2.5 Pro (top100 = 44) lie at N = 33 items within their respective 95% bootstrap CI of roughly ±2 points. Both models are therefore visualized on the ranking page as a shared rank 3 with equally tall bars, not as rank 3 and rank 4. At larger N (Pilot #2 targets 100 items) the CI bands narrow and finer differences become distinguishable — the current tie is therefore explicitly a sample property, not a model statement.

Consequence for the reader: When the ranking chart shows two models as tied, then either of the two is an equally well-substantiated choice. Anyone who needs a hard tie-break rule will find the hallucination rate and the inter-rater agreement per cell in the audit trail — both tie-breakers that do not rest on a spurious precision of the mean.

The following sections document the procedure in detail. Each item is collapsed — click to expand, or follow a direct anchor link (e.g. /methodik#triple-judge). Anchor links open the respective section automatically.

How is the question bank structured per domain?

Each of the four domains has a versioned question bank in the repository. The bank separates two kinds:

Public items — derived directly from accessible sources (BFH rulings, BGH decisions, medical guidelines, standard legal literature). These measure how well a model masters established content.
Synthetic items — created by a proposer model, critiqued by a reviewer model, human-approved by the owner. Realistic mandate cases with interwoven factors — not textbook exercises. These measure how well a model handles unfamiliar constellations, and eliminate the training effect as well as the tool-search advantage.

Question-bank versioning via Git: a second run against the same question_bank_version produces byte-identical answers. Methodology changes trigger a new version; old data points remain visible under their original version.

Synthetic items stay private — deliberately. The item prompts (case constellations, target facts, scoring rubrics) reside exclusively in a private repository and are never published. If the question catalog became public, model vendors could include the items in their next training dataset — the ranking would tip from genuine generalization ability to memorization. In the published raw.jsonl, the model answer texts and the verbatim extracted hallucinations are also removed, because the original items would be reconstructable from them. The aggregated cross-domain validation results, by contrast, are fully open in the public validation gist (10 files: Methodology, Legal Study, Medical Study, Limitations, Critique Response, …) — no single item text contained within. Raw answers are available exclusively on direct NDA request.

Which models are evaluated — and what is the tool configuration?

All enabled flagship models from the four providers are evaluated — typically Anthropic, OpenAI, Google, Mistral. Plug-in pattern via models.json: new providers are activated by a config entry, no code intervention needed.

Per cell, the model is tested twice: once solo (no tool access) and once with the tool registry enabled (web_search, doc_retrieval, pubmed_search, arxiv_search, url_fetch). Both series are published separately, because the model ordering can shift noticeably between the two modes.

More importantly: the best model changes from question to question — even within the same domain. Why a single ranking percentage is therefore misleading (and what the "oracle" / best-per-question means): Ranking → No model leads everywhere.

How is scoring done — closed items, open items, hallucination detection?

Items have two answer types:

Closed items expect a concrete answer (number, multiple choice, exact value). Scored via regex or range match: 1 or 0 per item.
Open items expect a fully worded justification. Scored by a dedicated open-rubric judge that, for every recorded target fact, decides: entail (contained in the model output), missing (absent) or contradict (contradicts).

In addition, the judge records in every answer extra false claims — freely invented assertions beyond the target rubric. Invented case numbers, wrong paragraph numbers, fabricated studies, wrong figures are extracted verbatim, archived internally, and published in aggregated form per run as the hallucination rate.

What is Triple-Judge — and why three model families?

Every open-item answer is scored by three independent judge models from three different vendor families: Claude Opus 4.7 (Anthropic) · GPT-5 (OpenAI) · Mistral Large 2 (Mistral). The answers reach the judge source-label-blinded: no judge knows which model produced the answer being scored.

Aggregation is over the mean of the three judge scores. The inter-rater agreement (agreement rate across the three judges) is published per cell as an additional audit metric — low agreement signals a constellation in which the scoring itself carries uncertainty.

Rationale for the three-family selection: If two of the three judges were from the same model family, the scoring of the third family's model could be systematically biased. Anthropic + OpenAI + Mistral as three independent providers eliminate this single-family bias.

How is reproducibility ensured?

Sampling parameters are deterministic: temperature=0 (except for reasoning models that do not accept it). The question bank is git-versioned. The model config (models.json) is git-versioned. The tool config is git-versioned. A second run against the same version + the same models must produce the same scores within the documented tolerance — deviations are recorded as an audit finding.

How does the append-only history work?

Once published, run results are never modified or deleted — not even after later methodology changes. Anyone who, in three years, wishes to audit the run from June 2026 will find the identical JSONL at the same URL as today.

Methodology changes (new question-bank version, altered judge-model mix, new tool slot) generate a new question_bank_version. Data points in the charts are annotated with their version; a methodology change is visible in the trend diagram as its own marker.

What deliberately does NOT appear in the leaderboard — and why?

Deliberate omissions, because they would distort the evaluation picture:

No latency/cost balance — costs and response times vary by the consumer's plan and are not a model property.
No rankings without a confidence figure — if two models lie within the CI bands, they are shown as tied.
No marketing models — advertised model variants without API access are not included.