Tax law
Live value from the current run.
Which AI model is the right one this week for my client case, my diagnosis, my contract? The AI-Roundtable Leaderboard delivers the dependable answer: a biweekly published ranking of leading language models on real cases from tax law, medicine, law and business law. Triple-judge scoring, hallucination detection, bootstrap confidence intervals, a fully auditable JSONL history.
max_output_tokens limit set too low relative to the internal thinking budget of the reasoning models (GPT-5 reasoning_tokens, Gemini 2.5 Pro thinking_tokens).
Diagnostic diff documented in the audit trail with finish_reason=MAX_TOKENS.
Snapshot of the current ranking as a cross-section across all four domains (tax law, medicine, law, business law). Step height proportional to the top100 score. — see methodology. The domain-specific breakdown follows further below — it shows that the order usually differs from one subject area to the next. That is why you should never trust a single model!
Quality score per model across the published runs.
Models consolidated. (Moving average n=3)
Every run scores all activated models on a fixed, versioned question bank per domain. Items are tracked separately by public (BFH/BGH/guidelines) and synthetic (to eliminate training effects and tool-search advantages).
A domain percentage is an average across many questions. For each question the best model changes — even within the same domain. The oracle ("best per question": picking the strongest model for every single question) is an upper bound that you cannot hit in advance — and even it stays below 100: a hard remainder that no model solves.
Live value from the current run.
Live value from the current run.
Live value from the current run.
Live value from the current run.
The AI-Roundtable solves the model-selection problem: while it does not reach the (theoretical) "best per question" oracle, it reliably beats every individual model you could fix in advance — because no model is consistently the best across all questions and domains. AI-Roundtable thereby eliminates the user's model-selection risk and delivers a result that sits above every fixed solo model you could choose. Methodology → model selection
Basis: the full published 14-day run across all domains — not the small counter-check sample.
Anyone relying on AI in a professional setting cannot afford to work with the wrong model — a fabricated citation, a wrong diagnosis, an embarrassing client letter are too expensive. The AI-Roundtable Leaderboard delivers the dependable answer every 14 days: which model is the right one for the next two weeks.
One unambiguous model recommendation per domain per run — based on score, hallucination rate and confidence interval. No list to wade through, just a concrete recommendation.
Model versions change rapidly. The leaderboard keeps you automatically up to date — the recommendation from three months ago is often the wrong one today.
Raw data public, methodology git-versioned, audit trail fully traceable. Demonstrable to regulators, auditors and clients — required reading wherever reliability is demanded.
A snapshot of the most important movements lands in your inbox automatically every 14 days. The subject line shows the top-mover highlight. Cancel monthly, no account, no login.
Three independent judge models from three different vendor families (Opus 4.7 · GPT-5 · Mistral Large 2) score every answer source-label-blinded. Inter-rater agreement is part of the public audit value.
Every fabricated claim — a wrong case number, an invented statute, a non-existent study — is extracted verbatim and recorded in the audit trail.
1000 resamples on the cell scores per model × domain. We publish the score mean + 95% percentile — no point value without an uncertainty estimate.
Question bank git-versioned, model config versioned, deterministic sampling parameters. A second run of the same version produces byte-identical answers.
Once published, run results are never modified or deleted. Methodology changes trigger a new question-bank version; old data points remain visible under their original version.
For each run the complete JSONL — every model answer, every judge rationale — is freely downloadable in the audit trail. No auth, no rate limits.
Anyone running the app on their Mac has several top models debate each other and check one another simultaneously. The leaderboard answers the question that comes before that: which models belong in the roundtable, and which one should take the key role over the next two weeks. The two products work hand in hand.
Three to four top models check one another, a moderator weighs the arguments. Local on your Mac, designed with professional confidentiality in mind under § 203 StGB (German criminal code), pseudonymized in client mode (Mandatsmodus).
The key contextual questions about the leaderboard — collapsed, click to expand. If you want to go deeper, you will find the full procedure in the methodology.
Because they measure something else. The big public leaderboards mostly evaluate generic tasks — multiple-choice knowledge (MMLU), chat preference by gut feeling (LMArena), coding tasks, almost all in English. The AI-Roundtable Leaderboard measures something very narrow and concrete: German-language professional cases from tax law, medicine, law and business law, against manually defined target facts, checked source-label-blinded by three judge families.
A model that tops a generic English best-of list can perform differently on a § 8c-KStG client case or a differential diagnosis — which is exactly what our domain breakdowns show. Three questions decide whether two rankings are comparable at all: what is being measured, on which task type and language, and at which date (model versions change rapidly — hence our 14-day rhythm). Our answers to these three questions are disclosed and verifiable in the audit trail; that is the real difference.
On top of that, manual scientific studies often face the problem that the period between conducting the study and publishing the (usually sobering) results can be very long — sometimes more than 1.5 years. That means the study publications you may have just read about in the news largely refer to AI models that were state of the art more than 12 months ago. In today's IT epochs that is literally light-years. But above all, that leaves the user still none the wiser in this very second about what the best model would be RIGHT NOW. That is precisely why our leaderboard ranking measures all the popular top models every 2 weeks. Only that way do we create real, timely transparency and, above all, trend curves with practical value.
For demanding cases: often yes — and that is exactly how we position the leaderboard. The professional remains, for now, the benchmark and the final authority. Our own figures say the same: even the oracle (picking the best model for every question in advance — not achievable in practice) stays below 100 points across all domains. There is a hard remainder that currently no model solves (why a number per model is misleading).
Here AI is a tool for support and a second opinion, not a substitute for professional judgement. The leaderboard helps you choose the most reliable available tool for the next two weeks — substantive responsibility, the counter-review and the final decision remain with the licensed professional. The ranking values are methodically derived guidance, expressly not legal, tax or medical advice.
That said, a further everyday honesty must necessarily be acknowledged: how likely it is that one will reach, with a personal question, a human expert who can actually give "the" desired correct and well-founded answer is equally uncertain. Because not every person is in fact an unrestricted expert in their field. Example: our Roundtable AI model would have passed the Swiss state law examination in every run with a final grade of 1.x on the first attempt. How many people manage that too? In our experience, few. And even fewer are definitively better than an AI — for now. How likely it is that you will land with your matter at an actual expert, the reader may better judge for themselves.
In practice, in five steps:
There are several effective levers that can be combined:
Both are possible — a hallucination cannot be ruled out entirely, but it can be detected and significantly reduced.
Detecting:
Reducing:
More on hallucination capture in the scoring procedure: Methodology → scoring.
Before the leaderboard went into operation in April 2026, the scoring pipeline was tested in a multi-month validation study across four professional fields. The current run parameters were only frozen after passing this cross-domain validation.
Tax law, medicine (BMJ + multimorbidity cases), law (BGH senates quota-weighted) and business law — each domain with its own question-bank version and a reproducible scoring pipeline.
Phase-E sprints with N=25–100 real case items per run each, tested for inter-judge stability (ρ ≥ 0.84), cross-domain generalization and memorization-confound robustness (pre-/post-training-cutoff comparison).
Anthropic + OpenAI + Mistral as triple-judge; source-label-blinded scoring per answer. More detail under Methodology → triple-judge.
The methodological findings established during the validation phase (e.g. Mistral's tax weakness as a domain-stratification effect, the expert-roles lift in BFH, inter-judge stability after tier-1 prompts) feed into the methodology of the running leaderboard and are traceably documented in the full methodology document.
The complete cross-domain validation with all raw statistics, win/loss/tie tables per sub-stratum, limitations and critique response is publicly available as a 10-file gist:
→ Public Validation Gist · 10 Files · ~80 Sessions Cross-Domain
Contains, among others: methodology, legal study (BGH/BFH), medical study (BMJ + MedExpQA), raw-metrics JSON, production roadmap, limitations & threats to validity, critique response, layman abstract. Straight to the limitations file →
raw.jsonl the
model answer texts as well as the verbatim-extracted
hallucinations are removed — the original case constellations
would be reconstructable from them. The aggregated validation results
(in the gist above) nonetheless remain fully public — not a
single item text in the 10 files, only methodology, sub-stratum
statistics and the cross-domain comparison. Complete raw answers
are available exclusively on direct NDA request for external
auditing.
Every run is published as a complete, immutable dataset.
——