AI-Roundtable Leaderboard
AI-Roundtable Leaderboard As of: Next test:

Ranking of the currently most reliable AI systems for tax, medicine and law.

Reproducible AI-model evaluation for licensed professionals — refreshed every 14 days. No single model leads everywhere: the per-question winner keeps changing.

Which AI model is the right one this week for my client case, my diagnosis, my contract? The AI-Roundtable Leaderboard delivers the dependable answer: a biweekly published ranking of leading language models on real cases from tax law, medicine, law and business law. Triple-judge scoring, hallucination detection, bootstrap confidence intervals, a fully auditable JSONL history.

Recommendation · loading …
out of 100 points
Comparison with the previous run to follow
Domain breakdown loading …
Scored by 3 independent AI reviewers.
CAUTION! The score per model can be misleading! Even the best model changes from question to question — its performance varies even within the same domain. Even someone who could "clairvoyantly" pick the best model for every question in advance (the "oracle") would still score below 100 — a hard remainder that no model has been able to solve so far. Why this matters …? →
Status & caveats for the current run loading …
Status: The current run is being loaded from the worker. Statistically thin — robust trend statements from run #4 onward (12 weeks). Methodology complete, raw JSONL transparently inspectable in the audit trail.
Full coverage: All four models (Opus 4.7, GPT-5, Gemini 2.5 Pro, Mistral Large 2) return an evaluated answer on all 33 items. Earlier coverage gaps in pilot runs #1–#6 were not caused by safety policies, but by a max_output_tokens limit set too low relative to the internal thinking budget of the reasoning models (GPT-5 reasoning_tokens, Gemini 2.5 Pro thinking_tokens). Diagnostic diff documented in the audit trail with finish_reason=MAX_TOKENS.

Winners' podium — cross-section across all domains

Snapshot of the current ranking as a cross-section across all four domains (tax law, medicine, law, business law). Step height proportional to the top100 score. — see methodology. The domain-specific breakdown follows further below — it shows that the order usually differs from one subject area to the next. That is why you should never trust a single model!

AI-ROUNDTABLE LEADERBOARD ranking.ai-roundtable.de As of: 2026-05-10 Loading live data … Cross-section across all domains · per-domain details below

Score history across all runs

Quality score per model across the published runs.

AI-ROUNDTABLE LEADERBOARD ranking.ai-roundtable.de top100

Vendor history

Models consolidated. (Moving average n=3)

AI-ROUNDTABLE LEADERBOARD ranking.ai-roundtable.de Ø top100

Four domains, 100 real case items each

Every run scores all activated models on a fixed, versioned question bank per domain. Items are tracked separately by public (BFH/BGH/guidelines) and synthetic (to eliminate training effects and tool-search advantages).

Tax law

hidden profit distribution (vGA), tax-group consolidation (Organschaft), § 8c KStG, cross-border aspects — complex client cases
Opus 4.769
GPT-551
Gemini 2.531
Mistral L211
Grok 4.30
Opus 4.80

Medicine

multimorbidity, drug interactions, atypical symptom presentation
GPT-584
Opus 4.782
Mistral L267
Gemini 2.553
Grok 4.30
Opus 4.80

Law

competing claims, standard-terms content review, consequences of a defect of form
Opus 4.779
GPT-571
Gemini 2.555
Mistral L247
Grok 4.30
Opus 4.80

Business law

share-transfer restrictions (Vinkulierung), voting bans, de-facto group liability
GPT-574
Opus 4.769
Gemini 2.546
Mistral L245
Grok 4.30
Opus 4.80
✓ Live data · auto-refresh from the currently published run · first bar per domain highlighted in green, last in red · updated with every new 14-day run.

No model leads everywhere — why a single number per model is misleading

A domain percentage is an average across many questions. For each question the best model changes — even within the same domain. The oracle ("best per question": picking the strongest model for every single question) is an upper bound that you cannot hit in advance — and even it stays below 100: a hard remainder that no model solves.

Tax law

Live value from the current run.

Medicine

Live value from the current run.

Law

Live value from the current run.

Business law

Live value from the current run.

The AI-Roundtable solves the model-selection problem: while it does not reach the (theoretical) "best per question" oracle, it reliably beats every individual model you could fix in advance — because no model is consistently the best across all questions and domains. AI-Roundtable thereby eliminates the user's model-selection risk and delivers a result that sits above every fixed solo model you could choose. Methodology → model selection

Basis: the full published 14-day run across all domains — not the small counter-check sample.

Maximum safety when it really counts.

Anyone relying on AI in a professional setting cannot afford to work with the wrong model — a fabricated citation, a wrong diagnosis, an embarrassing client letter are too expensive. The AI-Roundtable Leaderboard delivers the dependable answer every 14 days: which model is the right one for the next two weeks.

Clear actionable recommendation

One unambiguous model recommendation per domain per run — based on score, hallucination rate and confidence interval. No list to wade through, just a concrete recommendation.

Biweekly rhythm

Model versions change rapidly. The leaderboard keeps you automatically up to date — the recommendation from three months ago is often the wrong one today.

Compliance-conscious presentation

Raw data public, methodology git-versioned, audit trail fully traceable. Demonstrable to regulators, auditors and clients — required reading wherever reliability is demanded.

Straight to your inbox

A snapshot of the most important movements lands in your inbox automatically every 14 days. The subject line shows the top-mover highlight. Cancel monthly, no account, no login.

What sets the leaderboard apart

Triple-judge scoring

Three independent judge models from three different vendor families (Opus 4.7 · GPT-5 · Mistral Large 2) score every answer source-label-blinded. Inter-rater agreement is part of the public audit value.

Hallucination detection

Every fabricated claim — a wrong case number, an invented statute, a non-existent study — is extracted verbatim and recorded in the audit trail.

Bootstrap confidence intervals

1000 resamples on the cell scores per model × domain. We publish the score mean + 95% percentile — no point value without an uncertainty estimate.

Reproducible at T=0

Question bank git-versioned, model config versioned, deterministic sampling parameters. A second run of the same version produces byte-identical answers.

Append-only history

Once published, run results are never modified or deleted. Methodology changes trigger a new question-bank version; old data points remain visible under their original version.

Raw data public

For each run the complete JSONL — every model answer, every judge rationale — is freely downloadable in the audit trail. No auth, no rate limits.

A perfect complement to the AI-Roundtable app

Anyone running the app on their Mac has several top models debate each other and check one another simultaneously. The leaderboard answers the question that comes before that: which models belong in the roundtable, and which one should take the key role over the next two weeks. The two products work hand in hand.

Frequently asked questions

The key contextual questions about the leaderboard — collapsed, click to expand. If you want to go deeper, you will find the full procedure in the methodology.

There are other ranking studies that deliver completely different results. How does that fit together?

Because they measure something else. The big public leaderboards mostly evaluate generic tasks — multiple-choice knowledge (MMLU), chat preference by gut feeling (LMArena), coding tasks, almost all in English. The AI-Roundtable Leaderboard measures something very narrow and concrete: German-language professional cases from tax law, medicine, law and business law, against manually defined target facts, checked source-label-blinded by three judge families.

A model that tops a generic English best-of list can perform differently on a § 8c-KStG client case or a differential diagnosis — which is exactly what our domain breakdowns show. Three questions decide whether two rankings are comparable at all: what is being measured, on which task type and language, and at which date (model versions change rapidly — hence our 14-day rhythm). Our answers to these three questions are disclosed and verifiable in the audit trail; that is the real difference.

On top of that, manual scientific studies often face the problem that the period between conducting the study and publishing the (usually sobering) results can be very long — sometimes more than 1.5 years. That means the study publications you may have just read about in the news largely refer to AI models that were state of the art more than 12 months ago. In today's IT epochs that is literally light-years. But above all, that leaves the user still none the wiser in this very second about what the best model would be RIGHT NOW. That is precisely why our leaderboard ranking measures all the popular top models every 2 weeks. Only that way do we create real, timely transparency and, above all, trend curves with practical value.

As things stand, AI systems are still worse than human assessments, e.g. on medical questions. Is that true?

For demanding cases: often yes — and that is exactly how we position the leaderboard. The professional remains, for now, the benchmark and the final authority. Our own figures say the same: even the oracle (picking the best model for every question in advance — not achievable in practice) stays below 100 points across all domains. There is a hard remainder that currently no model solves (why a number per model is misleading).

Here AI is a tool for support and a second opinion, not a substitute for professional judgement. The leaderboard helps you choose the most reliable available tool for the next two weeks — substantive responsibility, the counter-review and the final decision remain with the licensed professional. The ranking values are methodically derived guidance, expressly not legal, tax or medical advice.

That said, a further everyday honesty must necessarily be acknowledged: how likely it is that one will reach, with a personal question, a human expert who can actually give "the" desired correct and well-founded answer is equally uncertain. Because not every person is in fact an unrestricted expert in their field. Example: our Roundtable AI model would have passed the Swiss state law examination in every run with a final grade of 1.x on the first attempt. How many people manage that too? In our experience, few. And even fewer are definitively better than an AI — for now. How likely it is that you will land with your matter at an actual expert, the reader may better judge for themselves.

What do I actually do with the results of these rankings in practice?

In practice, in five steps:

  • Choose a model per domain. For the next task, use the model currently ranked highest for your domain (tax / medicine / law / business law) — not "the best model" in general.
  • Don't trust a single model. On important questions, have several models check each other; the gap to the oracle shows that every fixed solo model has gaps.
  • Treat the output as a draft. Every answer is preparatory work; the professional counter-review remains mandatory — especially with a high score, which can feel deceptively safe.
  • Read the hallucination rate alongside it. It calibrates how skeptically you should check citations, case numbers and figures.
  • Check again every 14 days. The recommendation from three months ago is often the wrong one today — model versions shift the order.
How can the reliability of AI systems on important questions be increased?

There are several effective levers that can be combined:

  • Have several independent models cross-check each other. The principle behind triple-judge and the AI-Roundtable app: a finding supported by several models from different families is more dependable than the statement of a single model.
  • Bind answers to sources. Instead of free generation, have the model work with real documents and citations (retrieval, document lookup, web/URL fetch) and counter-check every cited reference.
  • Human in the loop. Final professional review by the licensed professional remains the most important safety anchor.
  • Demand sources and verify them. Explicitly ask for the case number, statute, guideline — and check whether they really exist.
  • Choose the right model for the domain — according to the current ranking, not out of habit.
  • Pseudonymize sensitive data before a model ever sees it (safeguarded in the app via the client mode, Mandatsmodus).
To what extent can hallucinations, for example, be detected or even reduced?

Both are possible — a hallucination cannot be ruled out entirely, but it can be detected and significantly reduced.

Detecting:

  • Consensus across several models. A fact claimed by only a single model and not backed by the others is suspect.
  • Source check. Verify whether the cited case number, statute or referenced study really exists — fabricated citations are the most common case.
  • Machine extraction. Our judges pull every fabricated claim (wrong case no., invented statute, non-existent study) verbatim out of the answer and publish the aggregated hallucination rate per run.

Reducing:

  • Bind to real sources (retrieval / tool use): the model reasons over presented documents instead of from memory.
  • Model debate, in which the models pin each other down on unsupported statements — exactly the procedure of the AI-Roundtable app.
  • Demand and verify citations as well as tightly scoped, well-posed prompts instead of open "tell me about …" questions.

More on hallucination capture in the scoring procedure: Methodology → scoring.

On what empirical basis does this ranking stand?

Before the leaderboard went into operation in April 2026, the scoring pipeline was tested in a multi-month validation study across four professional fields. The current run parameters were only frozen after passing this cross-domain validation.

4

domains scientifically validated

Tax law, medicine (BMJ + multimorbidity cases), law (BGH senates quota-weighted) and business law — each domain with its own question-bank version and a reproducible scoring pipeline.

200+

validation sessions per domain sprint

Phase-E sprints with N=25–100 real case items per run each, tested for inter-judge stability (ρ ≥ 0.84), cross-domain generalization and memorization-confound robustness (pre-/post-training-cutoff comparison).

3

independent judge families

Anthropic + OpenAI + Mistral as triple-judge; source-label-blinded scoring per answer. More detail under Methodology → triple-judge.

The methodological findings established during the validation phase (e.g. Mistral's tax weakness as a domain-stratification effect, the expert-roles lift in BFH, inter-judge stability after tier-1 prompts) feed into the methodology of the running leaderboard and are traceably documented in the full methodology document.

The complete cross-domain validation with all raw statistics, win/loss/tie tables per sub-stratum, limitations and critique response is publicly available as a 10-file gist:

→ Public Validation Gist · 10 Files · ~80 Sessions Cross-Domain

Contains, among others: methodology, legal study (BGH/BFH), medical study (BMJ + MedExpQA), raw-metrics JSON, production roadmap, limitations & threats to validity, critique response, layman abstract. Straight to the limitations file →

Why don't you make the question bank public?
Protection against training contamination. The synthetic client items (roughly 70–90% of each domain bank) reside exclusively in a private repository and are never published. If the item prompts went public, model vendors could include them in their next training dataset — the ranking would tip from a measurement of genuine generalization ability to a measurement of memorization, and all subsequent runs would be contaminated. For the same reason, in the published raw.jsonl the model answer texts as well as the verbatim-extracted hallucinations are removed — the original case constellations would be reconstructable from them. The aggregated validation results (in the gist above) nonetheless remain fully public — not a single item text in the 10 files, only methodology, sub-stratum statistics and the cross-domain comparison. Complete raw answers are available exclusively on direct NDA request for external auditing.

Recommend it now

Do you know someone who works with AI in a professional setting? The leaderboard is intended as an industry standard for reproducible model evaluation — spreading the word is part of the impact.

Audit trail

Every run is published as a complete, immutable dataset.

Current run — loading …
Items · models · triple-judge · inter-rater
Run ID:
Live API:
→ Full run history