Orionfold Arena — Leaderboard

Benches 4 cached evidence sources

Lanes ranked 36 unique (lane, bench)

Runs 78 bench + live

Schema v2 leak-proof · ✓

Generated 2026-06-11 10:56:17 UTC last fieldkit arena mirror

Cost / quality efficiency frontier 7 models · 30 builds

Every quant variant the Spark has measured, plotted as quality × throughput. The gold line is the Pareto frontier — the builds nothing else beats on both axes at once. A frontier public cloud arenas can't draw: they don't know what hardware their votes ran on. We do — the operator is the hardware.

Quality index is normalized per model (perplexity is corpus-dependent — only comparable within one base model). Each model's variants form its own curve; hover any point for the raw numbers. Per-model detail lives under Models.

View Sort

⬢ Bench-anchored cached tier · 2026-06-11 10:56:17 UTC

Orionfold Advisor — refusal-floor contract the-refusal-floor-is-trainable:advisor_contract 6 lanes · 6 runs metric · frozen OOD curveballs

Rank	Lane	Quality	Throughput	Runs
1	Advisor 4B — trained (SFT v0.2) ◆ flagshippromoted lanefrozen OOD · curveball v0.1 4b-sft-v0.2::curveball-v0.1::the-refusal-floor-is-trainable	90.0%	42.0 tok/s	1
2	Advisor 4B — trained (SFT v0.2) ◆ flagshippromoted lanefrozen OOD · curveball v0.2 4b-sft-v0.2::curveball-v0.2::the-refusal-floor-is-trainable	85.7%	42.0 tok/s	1
3	Advisor 4B — trained (SFT v0.1) supersededfrozen OOD · curveball v0.1 4b-sft-v0.1::curveball-v0.1::the-refusal-floor-is-trainable	70.0%	—	1
4	Nemotron 30B — teacher · prompt-only teacherfrozen OOD · curveball v0.1 30b-prompted::curveball-v0.1::the-refusal-floor-is-trainable	57.5%	—	1
5	Nemotron 4B — untrained base baselinefrozen OOD · curveball v0.1 4b-init::curveball-v0.1::the-refusal-floor-is-trainable	55.0%	—	1
6	Nemotron 30B — teacher · prompt-only teacherfrozen OOD · curveball v0.2 30b-prompted::curveball-v0.2::the-refusal-floor-is-trainable	38.1%	—	1

hermes-cost-routing-local-and-openrouter:cost_router 3 lanes · 3 runs metric · cost_router

Rank	Lane	Quality	Throughput	Runs
1	frontier-only	100.0%	—	1
2	cost-routed	91.7%	—	1
3	local-only	66.7%	—	1

hermes-vertical-router-on-spark:vertical_router 6 lanes · 6 runs metric · vertical_router

Rank	Lane	Quality	Throughput	Runs
1	cyber	100.0%	—	1
2	finance	100.0%	—	1
3	medical	100.0%	—	1
4	brain	80.0%	—	1
5	legal	80.0%	—	1
6	patent	80.0%	—	1

picking-the-hermes-brain-on-spark:hermes_brain 3 lanes · 3 runs metric · hermes_brain

Rank	Lane	Quality	Throughput	Runs
1	qwen3-30b-moe-llamacpp-q4km	90.0%	83.5 tok/s	1
2	qwen3-30b-moe-vllm-fp8	87.5%	55.0 tok/s	1
3	nim-incumbent	77.5%	23.9 tok/s	1

◉ Live cockpit runs — operator compares & chatsstatic snapshot

cockpit · all rubrics21 rows · 60 runsmetric · rubric mean

Rank	Model · rubric	Quality	Throughput	TTFT	$/task	$/quality	Runs	Human ↑
1	anthropic/claude-opus-4.8-fastOpenRouterpatent_claim_validity	100.0% ·fmt	158.0 tok/s	1262 ms	—	—	1	—
2	anthropic/claude-haiku-4.5OpenRoutergeneric-correctness	100.0% ·fmt	111.4 tok/s	1112 ms	$0.0013	$0.0013/pt	2	—
3	nvidia/nemotron-nano-9b-v2OpenRouterpatent_claim_validity	100.0% ·fmt	99.0 tok/s	3588 ms	—	—	3	—
4	securityllm-gguf (Q4_K_M)Spark GPUgeneric-correctness	100.0% ·fmt	48.9 tok/s	82 ms	—	—	1	—
5	ii-medical-8b-gguf (Q4_K_M)Spark GPUgeneric-correctness	100.0% ·fmt	44.5 tok/s	67 ms	—	—	1	—
6	patent-strategist-v3-nemo-gguf (Q4_K_M)Spark GPUpatent_claim_validity	100.0% ·fmt	41.1 tok/s	152 ms	—	—	4	—
7	discovered:8091Spark GPUgeneric-correctness	100.0% ·fmt	28.8 tok/s	450 ms	$0	$0 (local)	4	—
8	frontierOpenRouterpatent_claim_validity	100.0% ·fmt	27.3 tok/s	3179 ms	—	—	2	—
9	qwen/qwen3-8bOpenRoutergeneric-correctness	100.0% ·fmt	24.6 tok/s	157503 ms	$0.0001	$0.0001/pt	4	—
10	finance-chat-gguf (F16)Spark GPUgeneric-correctness	100.0% ·fmt	18.9 tok/s	172 ms	—	—	2	—
11	finance-chat-gguf (Q5_K_M)Spark GPUgeneric-correctness	100.0% ·fmt	16.1 tok/s	1040 ms	—	—	4	—
12	kepler (Q8_0)Spark GPUgeneric-correctness	100.0% ·fmt	8.6 tok/s	241 ms	$0	$0 (local)	4	—
13	openai/gpt-5.5-proOpenRoutergeneric-correctness	100.0% ·fmt	8.1 tok/s	21464 ms	$0.015	$0.0149/pt	7	—
14	frontierOpenRoutergeneric-correctness	100.0% ·fmt	—	0 ms	—	—	1	—
15	resident-brainSpark GPUgeneric-correctness	85.7% ·fmt	97.9 tok/s	134 ms	—	—	7	—
16	openai/gpt-5.5-proOpenRouterpatent_claim_validity	50.0% ·fmt	26437.0 tok/s	84263 ms	—	—	2	—
17	resident-brainSpark GPUpatent_claim_validity	42.9% ·fmt	89.3 tok/s	140 ms	—	—	7	—
18	stepfun/step-3.7-flashOpenRouterpatent_claim_validity	0.0% ·fmt	239.6 tok/s	6530 ms	—	—	1	—
19	openai/gpt-4o-miniOpenRouterpatent_claim_validity	0.0% ·fmt	200.5 tok/s	792 ms	—	—	1	—
20	saul-7b-instruct-v1-gguf (Q4_K_M)Spark GPUpatent_claim_validity	0.0% ·fmt	46.5 tok/s	62 ms	—	—	1	—
21	deepseek/deepseek-r1-0528OpenRoutergeneric-correctness	0.0% ·fmt	—	—	$0.0000	—	1	—

Source — fieldkit.arena.mirror.export_publishable_slice(); allowlist pinned by fieldkit/tests/arena/demo/test_mirror_does_not_leak.py. The chat_* tables, compare_runs.prompt, and compare_responses.{content,reasoning} are NEVER enumerated.