M6 · leak-proof public mirror
Leaderboard
Bench-anchored evidence plus live cockpit runs against the resident brain —
ranked on quality, tok/s, and (when ≥5 prefs land) human preference. Built
from the M6 mirror exporter's allowlist slice of ~/.fieldkit/arena.db.
(lane, bench) fieldkit arena mirror Every quant variant the Spark has measured, plotted as quality × throughput. The gold line is the Pareto frontier — the builds nothing else beats on both axes at once. A frontier public cloud arenas can't draw: they don't know what hardware their votes ran on. We do — the operator is the hardware.
Quality index is normalized per model (perplexity is corpus-dependent — only comparable within one base model). Each model's variants form its own curve; hover any point for the raw numbers. Per-model detail lives under Models.
⬢ Bench-anchored — cached evidence
| Rank | Lane | Quality | Throughput | Runs |
|---|---|---|---|---|
| 1 | frontier-only | — | 1 | |
| 2 | cost-routed | — | 1 | |
| 3 | local-only | — | 1 |
| Rank | Lane | Quality | Throughput | Runs |
|---|---|---|---|---|
| 1 | cyber | — | 1 | |
| 2 | finance | — | 1 | |
| 3 | medical | — | 1 | |
| 4 | brain | — | 1 | |
| 5 | legal | — | 1 | |
| 6 | patent | — | 1 |
| Rank | Lane | Quality | Throughput | Runs |
|---|---|---|---|---|
| 1 | qwen3-30b-moe-llamacpp-q4km | 83.5 tok/s | 1 | |
| 2 | qwen3-30b-moe-vllm-fp8 | 55.0 tok/s | 1 | |
| 3 | nim-incumbent | 23.9 tok/s | 1 |
◉ Live cockpit runs — operator compares
| Rank | Rubric · Lane | Quality | Throughput | TTFT | Runs | Human ↑ |
|---|---|---|---|---|---|---|
| 1 | patent_claim_validity vs · openrouter-frontier | 27.3 tok/s | 3179 ms | 2 | — | |
| 2 | patent_claim_validity vs · resident-brain | 88.1 tok/s | 100 ms | 2 | — |
Source — fieldkit.arena.mirror.export_publishable_slice(); allowlist pinned by fieldkit/tests/arena/demo/test_mirror_does_not_leak.py.
The chat_* tables, compare_runs.prompt, and compare_responses.{content,reasoning} are NEVER enumerated.