Audit trail

The complete raw data of every run.

Every published run is available here as a complete, unaltered JSONL dataset. Including all model responses, all judge rationales, and the hallucination extraction.

Available per run

Each run appears here with the following public artifacts:

progress.jsonl — one line per (model × item × replicate) cell, with response, token usage, and hallucination marker.
raw.jsonl — complete judge outputs: entail/missing/contradict verdict per target fact, plus an extra-false-claims list per judge.
summary.json — scores aggregated per (model × domain), bootstrap CI, hallucination rate, inter-rater agreement, token balance, and the USD cost of the pipeline.
MANIFEST.json — run metadata: run ID, timestamp, question-bank version, active models, active judges, code version.

Run history

Chronological list of all published runs — newest first. Per entry: date, run ID, model/judge count, item count, and direct links to the raw data (JSONL + summary).

⏳

Loading run history …

How I use the audit trail

Three use cases the open audit trail is designed for:

Spot-check verification: I pick any model response, read the judge's rationale, and decide for myself whether the assessment holds.
Hallucination audit: I filter extra_false_claims by a given model and see every fabricated claim verbatim.
Methodology review: I pull two runs with the same question_bank_version and check whether the reproducibility guarantee holds.