Which AI is best for tax law, law or medicine?

There is no consistently best model — which vendor is the most accurate changes by subject area and by question. The AI-Roundtable Leaderboard measures the leading models every 14 days against real client cases in tax law, medicine, law and business law and shows the currently best one per domain.

Why do other AI rankings deliver completely different results?

Because they measure something different: mostly generic, English-language tasks. The AI-Roundtable Leaderboard tests German-language professional cases against defined target facts, source-blind by three judge families, refreshed every 14 days — model versions change rapidly.

Are AI systems worse than human experts on important questions?

On demanding cases, often yes — the professional remains the final authority. AI is a tool and a second opinion, not a substitute for professional judgement; the ranking helps you choose the most reliable available model.

How can the reliability of AI answers on important questions be increased?

Cross-check several independent models, bind answers to real sources and verify them, keep a human in the loop, pick the best model for the domain, and pseudonymize sensitive data before the model call.

AI-Roundtable Leaderboard As of: — Next test: —

Ranking of the currently most reliable AI systems for tax, medicine and law.

Name: AI-Roundtable Leaderboard
Creator: AI Roundtable
License: https://creativecommons.org/licenses/by/4.0/

Reproducible AI-model evaluation for licensed professionals — refreshed every 14 days. No single model leads everywhere: the per-question winner keeps changing.

Which AI model is the right one this week for my client case, my diagnosis, my contract? The AI-Roundtable Leaderboard delivers the dependable answer: a biweekly published ranking of leading language models on real cases from tax law, medicine, law and business law. Triple-judge scoring, hallucination detection, bootstrap confidence intervals, a fully auditable JSONL history.

Recommendation · loading …

—

— out of 100 points —

→ Comparison with the previous run to follow

Domain breakdown loading …

Scored by 3 independent AI reviewers.

How to read this recommendation

The 0–100 number shows what percentage of the manually defined target facts a model named and substantiated correctly in its answer — checked against real client and case constellations.

80–100 · very strong — defensible for high-risk client cases
60–79 · solid for everyday professional use
40–59 · usable with judgement, counter-review mandatory
below 40 · not yet recommended

Status & caveats for the current run loading …

Status: The current run is being loaded from the worker. Statistically thin — robust trend statements from run #4 onward (12 weeks). Methodology complete, raw JSONL transparently inspectable in the audit trail.

Full coverage: All four models (Opus 4.7, GPT-5, Gemini 2.5 Pro, Mistral Large 2) return an evaluated answer on all 33 items. Earlier coverage gaps in pilot runs #1–#6 were not caused by safety policies, but by a max_output_tokens limit set too low relative to the internal thinking budget of the reasoning models (GPT-5 reasoning_tokens, Gemini 2.5 Pro thinking_tokens). Diagnostic diff documented in the audit trail with finish_reason=MAX_TOKENS.

Winners' podium — cross-section across all domains

Snapshot of the current ranking as a cross-section across all four domains (tax law, medicine, law, business law). Step height proportional to the top100 score. — see methodology. The domain-specific breakdown follows further below — it shows that the order usually differs from one subject area to the next. That is why you should never trust a single model!

Score history across all runs

Quality score per model across the published runs.

Vendor history

Models consolidated. (Moving average n=3)

Four domains, 100 real case items each

Every run scores all activated models on a fixed, versioned question bank per domain. Items are tracked separately by public (BFH/BGH/guidelines) and synthetic (to eliminate training effects and tool-search advantages).

Tax law

hidden profit distribution (vGA), tax-group consolidation (Organschaft), § 8c KStG, cross-border aspects — complex client cases

Opus 4.769

GPT-551

Gemini 2.531

Mistral L211

Grok 4.30

Opus 4.80

Medicine

multimorbidity, drug interactions, atypical symptom presentation

GPT-584

Opus 4.782

Mistral L267

Gemini 2.553

Grok 4.30

Opus 4.80

Law

competing claims, standard-terms content review, consequences of a defect of form

Opus 4.779

GPT-571

Gemini 2.555

Mistral L247

Grok 4.30

Opus 4.80

Business law

share-transfer restrictions (Vinkulierung), voting bans, de-facto group liability

GPT-574

Opus 4.769

Gemini 2.546

Mistral L245

Grok 4.30

Opus 4.80

No model leads everywhere — why a single number per model is misleading

A domain percentage is an average across many questions. For each question the best model changes — even within the same domain. The oracle ("best per question": picking the strongest model for every single question) is an upper bound that you cannot hit in advance — and even it stays below 100: a hard remainder that no model solves.

—

Tax law

Live value from the current run.

—

Medicine

Live value from the current run.

—

Law

Live value from the current run.

—

Business law

Live value from the current run.

The AI-Roundtable solves the model-selection problem: while it does not reach the (theoretical) "best per question" oracle, it reliably beats every individual model you could fix in advance — because no model is consistently the best across all questions and domains. AI-Roundtable thereby eliminates the user's model-selection risk and delivers a result that sits above every fixed solo model you could choose. Methodology → model selection

Basis: the full published 14-day run across all domains — not the small counter-check sample.

Maximum safety when it really counts.

Anyone relying on AI in a professional setting cannot afford to work with the wrong model — a fabricated citation, a wrong diagnosis, an embarrassing client letter are too expensive. The AI-Roundtable Leaderboard delivers the dependable answer every 14 days: which model is the right one for the next two weeks.

→

Clear actionable recommendation

One unambiguous model recommendation per domain per run — based on score, hallucination rate and confidence interval. No list to wade through, just a concrete recommendation.

⟳

Biweekly rhythm

Model versions change rapidly. The leaderboard keeps you automatically up to date — the recommendation from three months ago is often the wrong one today.

✓

Compliance-conscious presentation

Raw data public, methodology git-versioned, audit trail fully traceable. Demonstrable to regulators, auditors and clients — required reading wherever reliability is demanded.

✉

Straight to your inbox

A snapshot of the most important movements lands in your inbox automatically every 14 days. The subject line shows the top-mover highlight. Cancel monthly, no account, no login.

What sets the leaderboard apart

Triple-judge scoring

Three independent judge models from three different vendor families (Opus 4.7 · GPT-5 · Mistral Large 2) score every answer source-label-blinded. Inter-rater agreement is part of the public audit value.

Hallucination detection

Every fabricated claim — a wrong case number, an invented statute, a non-existent study — is extracted verbatim and recorded in the audit trail.

Bootstrap confidence intervals

1000 resamples on the cell scores per model × domain. We publish the score mean + 95% percentile — no point value without an uncertainty estimate.

Reproducible at T=0

Question bank git-versioned, model config versioned, deterministic sampling parameters. A second run of the same version produces byte-identical answers.

Append-only history

Once published, run results are never modified or deleted. Methodology changes trigger a new question-bank version; old data points remain visible under their original version.

Raw data public

For each run the complete JSONL — every model answer, every judge rationale — is freely downloadable in the audit trail. No auth, no rate limits.

A perfect complement to the AI-Roundtable app

Anyone running the app on their Mac has several top models debate each other and check one another simultaneously. The leaderboard answers the question that comes before that: which models belong in the roundtable, and which one should take the key role over the next two weeks. The two products work hand in hand.

AI Roundtable — the AI second opinion for decisions that carry weight

Three to four top models check one another, a moderator weighs the arguments. Local on your Mac, designed with professional confidentiality in mind under § 203 StGB (German criminal code), pseudonymized in client mode (Mandatsmodus).

To the main site →

Frequently asked questions

The key contextual questions about the leaderboard — collapsed, click to expand. If you want to go deeper, you will find the full procedure in the methodology.

There are other ranking studies that deliver completely different results. How does that fit together?

Because they measure something else. The big public leaderboards mostly evaluate generic tasks — multiple-choice knowledge (MMLU), chat preference by gut feeling (LMArena), coding tasks, almost all in English. The AI-Roundtable Leaderboard measures something very narrow and concrete: German-language professional cases from tax law, medicine, law and business law, against manually defined target facts, checked source-label-blinded by three judge families.

A model that tops a generic English best-of list can perform differently on a § 8c-KStG client case or a differential diagnosis — which is exactly what our domain breakdowns show. Three questions decide whether two rankings are comparable at all: what is being measured, on which task type and language, and at which date (model versions change rapidly — hence our 14-day rhythm). Our answers to these three questions are disclosed and verifiable in the audit trail; that is the real difference.

On top of that, manual scientific studies often face the problem that the period between conducting the study and publishing the (usually sobering) results can be very long — sometimes more than 1.5 years. That means the study publications you may have just read about in the news largely refer to AI models that were state of the art more than 12 months ago. In today's IT epochs that is literally light-years. But above all, that leaves the user still none the wiser in this very second about what the best model would be RIGHT NOW. That is precisely why our leaderboard ranking measures all the popular top models every 2 weeks. Only that way do we create real, timely transparency and, above all, trend curves with practical value.

As things stand, AI systems are still worse than human assessments, e.g. on medical questions. Is that true?

For demanding cases: often yes — and that is exactly how we position the leaderboard. The professional remains, for now, the benchmark and the final authority. Our own figures say the same: even the oracle (picking the best model for every question in advance — not achievable in practice) stays below 100 points across all domains. There is a hard remainder that currently no model solves (why a number per model is misleading).

Here AI is a tool for support and a second opinion, not a substitute for professional judgement. The leaderboard helps you choose the most reliable available tool for the next two weeks — substantive responsibility, the counter-review and the final decision remain with the licensed professional. The ranking values are methodically derived guidance, expressly not legal, tax or medical advice.

That said, a further everyday honesty must necessarily be acknowledged: how likely it is that one will reach, with a personal question, a human expert who can actually give "the" desired correct and well-founded answer is equally uncertain. Because not every person is in fact an unrestricted expert in their field. Example: our Roundtable AI model would have passed the Swiss state law examination in every run with a final grade of 1.x on the first attempt. How many people manage that too? In our experience, few. And even fewer are definitively better than an AI — for now. How likely it is that you will land with your matter at an actual expert, the reader may better judge for themselves.

What do I actually do with the results of these rankings in practice?

In practice, in five steps:

Choose a model per domain. For the next task, use the model currently ranked highest for your domain (tax / medicine / law / business law) — not "the best model" in general.
Don't trust a single model. On important questions, have several models check each other; the gap to the oracle shows that every fixed solo model has gaps.
Treat the output as a draft. Every answer is preparatory work; the professional counter-review remains mandatory — especially with a high score, which can feel deceptively safe.
Read the hallucination rate alongside it. It calibrates how skeptically you should check citations, case numbers and figures.
Check again every 14 days. The recommendation from three months ago is often the wrong one today — model versions shift the order.

How can the reliability of AI systems on important questions be increased?

There are several effective levers that can be combined:

Have several independent models cross-check each other. The principle behind triple-judge and the AI-Roundtable app: a finding supported by several models from different families is more dependable than the statement of a single model.
Bind answers to sources. Instead of free generation, have the model work with real documents and citations (retrieval, document lookup, web/URL fetch) and counter-check every cited reference.
Human in the loop. Final professional review by the licensed professional remains the most important safety anchor.
Demand sources and verify them. Explicitly ask for the case number, statute, guideline — and check whether they really exist.
Choose the right model for the domain — according to the current ranking, not out of habit.
Pseudonymize sensitive data before a model ever sees it (safeguarded in the app via the client mode, Mandatsmodus).

To what extent can hallucinations, for example, be detected or even reduced?

Both are possible — a hallucination cannot be ruled out entirely, but it can be detected and significantly reduced.

Detecting:

Consensus across several models. A fact claimed by only a single model and not backed by the others is suspect.
Source check. Verify whether the cited case number, statute or referenced study really exists — fabricated citations are the most common case.
Machine extraction. Our judges pull every fabricated claim (wrong case no., invented statute, non-existent study) verbatim out of the answer and publish the aggregated hallucination rate per run.

Reducing:

Bind to real sources (retrieval / tool use): the model reasons over presented documents instead of from memory.
Model debate, in which the models pin each other down on unsupported statements — exactly the procedure of the AI-Roundtable app.
Demand and verify citations as well as tightly scoped, well-posed prompts instead of open "tell me about …" questions.

More on hallucination capture in the scoring procedure: Methodology → scoring.

On what empirical basis does this ranking stand?

Before the leaderboard went into operation in April 2026, the scoring pipeline was tested in a multi-month validation study across four professional fields. The current run parameters were only frozen after passing this cross-domain validation.

domains scientifically validated

Tax law, medicine (BMJ + multimorbidity cases), law (BGH senates quota-weighted) and business law — each domain with its own question-bank version and a reproducible scoring pipeline.

200+

validation sessions per domain sprint

Phase-E sprints with N=25–100 real case items per run each, tested for inter-judge stability (ρ ≥ 0.84), cross-domain generalization and memorization-confound robustness (pre-/post-training-cutoff comparison).

independent judge families

Anthropic + OpenAI + Mistral as triple-judge; source-label-blinded scoring per answer. More detail under Methodology → triple-judge.

The methodological findings established during the validation phase (e.g. Mistral's tax weakness as a domain-stratification effect, the expert-roles lift in BFH, inter-judge stability after tier-1 prompts) feed into the methodology of the running leaderboard and are traceably documented in the full methodology document.

The complete cross-domain validation with all raw statistics, win/loss/tie tables per sub-stratum, limitations and critique response is publicly available as a 10-file gist:

→ Public Validation Gist · 10 Files · ~80 Sessions Cross-Domain

Contains, among others: methodology, legal study (BGH/BFH), medical study (BMJ + MedExpQA), raw-metrics JSON, production roadmap, limitations & threats to validity, critique response, layman abstract. Straight to the limitations file →

Why don't you make the question bank public?

Protection against training contamination. The synthetic client items (roughly 70–90% of each domain bank) reside exclusively in a private repository and are never published. If the item prompts went public, model vendors could include them in their next training dataset — the ranking would tip from a measurement of genuine generalization ability to a measurement of memorization, and all subsequent runs would be contaminated. For the same reason, in the published raw.jsonl the model answer texts as well as the verbatim-extracted hallucinations are removed — the original case constellations would be reconstructable from them. The aggregated validation results (in the gist above) nonetheless remain fully public — not a single item text in the 10 files, only methodology, sub-stratum statistics and the cross-domain comparison. Complete raw answers are available exclusively on direct NDA request for external auditing.

Audit trail

Every run is published as a complete, immutable dataset.

Current run — loading …

Items · models · triple-judge · inter-rater

Run ID: —
Live API: —

→ Full run history

Ranking of the currently most reliable AI systems for tax, medicine and law.

Winners' podium — cross-section across all domains

Score history across all runs

Vendor history

Four domains, 100 real case items each

Tax law

Medicine

Law

Business law

No model leads everywhere — why a single number per model is misleading

Tax law

Medicine

Law

Business law

Maximum safety when it really counts.

Clear actionable recommendation

Biweekly rhythm

Compliance-conscious presentation

Straight to your inbox

What sets the leaderboard apart

Triple-judge scoring

Hallucination detection

Bootstrap confidence intervals

Reproducible at T=0

Append-only history

Raw data public

A perfect complement to the AI-Roundtable app

AI Roundtable — the AI second opinion for decisions that carry weight

Frequently asked questions

domains scientifically validated

validation sessions per domain sprint

independent judge families

Recommend it now

Audit trail