The complete raw data of every run.
Every published run is available here as a complete, unaltered JSONL dataset. Including all model responses, all judge rationales, and the hallucination extraction.
Available per run
Each run appears here with the following public artifacts:
progress.jsonl— one line per (model × item × replicate) cell, with response, token usage, and hallucination marker.raw.jsonl— complete judge outputs: entail/missing/contradict verdict per target fact, plus an extra-false-claims list per judge.summary.json— scores aggregated per (model × domain), bootstrap CI, hallucination rate, inter-rater agreement, token balance, and the USD cost of the pipeline.MANIFEST.json— run metadata: run ID, timestamp, question-bank version, active models, active judges, code version.
Run history
Chronological list of all published runs — newest first. Per entry: date, run ID, model/judge count, item count, and direct links to the raw data (JSONL + summary).
How I use the audit trail
Three use cases the open audit trail is designed for:
- Spot-check verification: I pick any model response, read the judge's rationale, and decide for myself whether the assessment holds.
- Hallucination audit: I filter
extra_false_claimsby a given model and see every fabricated claim verbatim. - Methodology review: I pull two runs
with the same
question_bank_versionand check whether the reproducibility guarantee holds.