fieldkit · module

fieldkit.arena

Operator cockpit for the DGX Spark — FastAPI sidecar on 127.0.0.1:7866 with SSE telemetry/chat/compare streams, a SQLite-backed `~/.fieldkit/arena.db`, and a static-mirror exporter that publishes a leak-proof leaderboard slice to `ainative.business/arena/`. Sibling to `fieldkit.harness` (Hermes = agent harness; Arena = operator harness). M2 ships the SQLite store + retroactive importer; M3 ships the FastAPI app + telemetry SSE; M4 ships the chat island; M5 ships side-by-side compare; **M6 (this release) ships the leak-proof mirror exporter** (`fieldkit.arena.mirror.export_publishable_slice` with hardcoded allowlist; regression test pins zero chat-content leaks); M7 lands the launch article + Mac sync per `_SPECS/spark-arena-v1.md`.

What it is

The Harnesses arc taught the project to publish agent harnesses — Hermes drives Spark, fieldkit-as-MCP keystone, vertical + cost routers. fieldkit.arena is the operator counterpart: the cockpit a solo Spark builder uses to drive every artifact the rest of the package has shipped. Six months of work has accreted 49 articles, 17 manifests under src/content/artifacts/, 13 HF repos under the Orionfold/ namespace, and a 950-test fieldkit substrate — none of it had a single surface to drive it from until now. The cockpit lives at http://127.0.0.1:7866/arena/ (loopback only) with a static slice mirrored to ainative.business/arena/. Per feedback_llm_skill_pattern the module is deterministic Python only — all LLM generation (rubric prompts, prose) stays in session-driven skills.

The full design is in _SPECS/spark-arena-v1.md. M2 ships the SQLite store + the retroactive importer; M3–M7 fill the substantive sidecar surface (see [Unreleased] in CHANGELOG.md).

Status: v0.2 product leap — Orionfold Arena. Builds on the M1–M6 sidecar with six showcase surfaces: a Models/capabilities browser, the cost/quality efficiency frontier, Compare markdown+winner+delta parity, a ⌘K command palette, the telemetry↔article-evidence bridge, and the Lab co-iteration board (/arena/lab/) with an operator-private lab_notes annotation layer (GET/POST/DELETE /api/lab/notes; on FORBIDDEN_TABLES). Distribution: the runnable cockpit now ships inside the fieldkit wheelpip install fieldkit[arena]fieldkit arena uphttp://127.0.0.1:7866/arena/ — baked by fieldkit arena build and served via a StaticFiles mount. The leak gate fieldkit/tests/arena/test_mirror_does_not_leak.py still pins zero operator-private leaks (chat + lab_notes). The full breakdown lives in HANDOFF.md’s 🏟️ ARENA TRACK section.

Public API (today — M6)

from fieldkit.arena import (
    # version pin
    ARENA_SURFACE_VERSION,
    # constants — operator-visible, frozen at spec §3.4
    DEFAULT_ARENA_PORT,   # 7866
    DEFAULT_ARENA_DB,     # "~/.fieldkit/arena.db"
    # errors (hierarchy stable from day one)
    ArenaError,
    LaneNotRegistered,
    PublishableSliceEmpty,
    # M2 — synchronous SQLite store + the retroactive-load surface
    ArenaStore,
    ImportReport,
    import_artifacts,
    # M2 — row records (the importer constructs these; the store persists them)
    LaneRecord,
    BenchResultRow,
    ArticleIndexRow,
    HfMetaRow,
    NotebookExportRow,
    LeaderboardRow,
    # M3 — FastAPI sidecar (lazy: import is stdlib-cheap; calling
    # `create_app()` pulls FastAPI + sse-starlette behind the `arena` extra)
    create_app,
    serve,
    TelemetryHub,
    # M4 — chat session + turn records (operator-private; never mirrored)
    ChatSessionRecord,
    ChatTurnRecord,
    # M5 — compare / rubric-score / human-pref records + the default
    # rubric registry the side-by-side compare scores against
    CompareRunRecord,
    CompareResponseRecord,
    RubricScoreRecord,
    HumanPrefRecord,
    RubricSpec,
    DEFAULT_RUBRIC_REGISTRY,
    default_rubric_for_prompt,
    # M6 — leak-proof public mirror exporter. Hardcoded allowlist guard;
    # chat_* tables NEVER enumerated. The regression test
    # fieldkit/tests/arena/test_mirror_does_not_leak.py pins zero leaks
    # against random-UUID sentinels (the M7-blocker gate).
    export_publishable_slice,
    rebuild_leaderboard,
    ExportReport,
    RebuildReport,
    MIRROR_SCHEMA_VERSION,
    PUBLISHABLE_TABLES,
    FORBIDDEN_TABLES,
    FORBIDDEN_COLUMNS,
    # M8 — control-plane queue (operator-private; never mirrored). The job
    # records, the dispatcher (executes through the fieldkit.harness MCP
    # surface), and the leaderboard-regression trigger producer. See §12.
    JobRecord,
    JobTriggerRecord,
    JobKind,
    JobStatus,
    enqueue_job,
    dispatch_job,
    drain_jobs,
    detect_leaderboard_regression,
    enqueue_regressions,
    JobDispatchError,
    UnknownJobKind,
)

M3 — create_app() + the sidecar endpoints

The FastAPI app factory. Lazy on FastAPI / sse-starlette / uvicorn imports, so import fieldkit.arena.server is stdlib-cheap and the failure mode without the arena extra installed is a clear RuntimeError pointing the operator at pip install 'fieldkit[arena]'.

from fieldkit.arena import create_app
app = create_app(repo_root="/home/nvidia/ainative-business.github.io", telemetry_interval=0.5)
# Mount under uvicorn, or use fieldkit.arena.serve(...)
KwargDefaultWhat it does
db~/.fieldkit/arena.dbOperator-private SQLite path. Created lazily by ArenaStore.initialize() on first read.
repo_rootPath.cwd()Source-of-truth for the static mirror JSON (src/data/arena-mirror/leaderboard.json). Pass explicitly when running the sidecar from a different cwd.
telemetry_interval0.5 (spec §4.6)Seconds between SSE telemetry ticks while a subscriber is open. Set lower for unit smokes; the spec’s hardware-shape claim is 500 ms.
cors_originsdev set (:4321 + localhost + Spark LAN IP)Astro dev page at :4321 needs CORS to reach the sidecar at :7866. Production mirror has no live fetches.

Endpoints

MethodPathPurpose
GET/healthzLiveness + ARENA_SURFACE_VERSION + subscribers count + telemetry_running flag.
GET/api/lanesLive read: resident brain from ~/.hermes/config.yaml (re-read every request per Risk R8) + roster from the M2 lanes table (empty if the store doesn’t exist yet).
GET/api/leaderboard?limit=NProxies the static mirror JSON. M5 will rebuild this from compare_runs / rubric_scores; M3 reads what M2 seeded.
GET/api/telemetry/streamSSE — one telemetry event per telemetry_interval while subscribed; payload shape per spec §4.6 (ts, gpu_util, gpu_temp_c, unified_used_gb, unified_total_gb, inflight, tok_per_s, ttft_ms, lane_id). Yields a hello payload immediately on subscribe so the gauge paints without waiting for the first sampler interval.

M3 — serve(host, port, ...)

Thin uvicorn launcher. Loopback-only by default (no auth in v0.1 per spec §3.1 #4); pass --reload to use uvicorn’s source-watcher (works because the launcher swaps to import-string mode and reads ARENA_DB / ARENA_REPO_ROOT env vars to reconstruct create_app(...) inside the worker process).

fieldkit arena serve --port 7866 --repo-root /home/nvidia/ainative-business.github.io
KwargDefaultWhat it does
host"127.0.0.1"Loopback by default; pass 0.0.0.0 to expose on the LAN (no auth in v0.1 per spec §3.1 #4).
portDEFAULT_ARENA_PORT (7866)Spec §3.4 — locked.
db~/.fieldkit/arena.dbOperator-private SQLite path.
repo_rootcwdWhere to find src/data/arena-mirror/leaderboard.json.
reloadFalseuvicorn --reload; swaps the launcher to an import-string and threads ARENA_DB / ARENA_REPO_ROOT env vars into the worker process.
log_level"info"uvicorn log level ("debug" is useful during M3 SSE bring-up).

M3 — TelemetryHub

Reference-counted wrapper around fieldkit.harness.Telemetry. The first SSE subscriber attaches the underlying sampler; the last one to disconnect stops it (spec §4.6 zero-idle commitment). Layered on top of Telemetry rather than modifying it — keeps the harness module stable.

Attribute / MethodWhat it does
TelemetryHub(interval=0.5)Construct; sampler starts on first subscribe().
.subscribe(loop)Returns (asyncio.Queue, unsubscribe_callable). First subscriber starts the sampler; unsubscribe() is idempotent.
.subscriber_countCurrent attached subscriber count (thread-safe).
.is_runningTrue between the first subscribe and the last unsubscribe.
.report_inflight(inflight, tok_per_s, ttft_ms, lane_id)M4+ stream callers tag the active lane + speeds; idle ticks read these as sticky defaults until refreshed.

M3 — telemetry_event_stream(hub, request)

Async generator powering /api/telemetry/stream. Extracted out of the route closure so unit tests can drive it directly against an in-memory queue (no FastAPI / sse-starlette round-trip needed) — see fieldkit/tests/arena/test_server.py.

M4 — POST /api/chat/stream

Single-lane chat against the resident brain. Resolves the lane from ~/.hermes/config.yaml on every request (Risk R8 — operator config edits take effect on the next turn). Streams via SSE; persists user + assistant turns to chat_sessions + chat_turns (operator-private; the M6 mirror exporter’s hardcoded allowlist hardcodes chat_* tables OUT of its enumeration).

Request body (Pydantic):

FieldDefaultNotes
promptrequired1–32 000 chars. The user’s turn for this round.
session_idNoneOmit on the first turn; the server allocates one + echoes it in the start event. Thread it back on subsequent turns to keep the multi-turn history in the same session.
rubric_idNoneReserved for M5 score-on-completion (rubric_scores); ignored at M4.
max_tokens4096Per-turn budget. Qwen3-30B-A3B with --reasoning-format none emits a long <think> chain — truncating loses the answer.
temperature0.0Deterministic by default — the cockpit is for measured comparisons.

Emitted SSE events:

  • start{session_id, lane_id, model, base_url}. Painted by the client as the conversation header / lane chip.
  • token{channel: "reasoning"|"content", text: "..."}. One per upstream chunk; the channel classifier flips at the <think> / </think> boundary, so the UI collapses the reasoning into a <details> block by default.
  • done{session_id, turn_id, ttft_ms, tok_per_s, tokens_out, wall_s, finish_reason}. The client pins the perf metadata to the assistant card footer.
  • error{detail} (on upstream stream errors). The partial reply is still persisted with finish_reason="error" so the operator can forensic it.
  • heartbeat{} (sse-starlette’s keepalive on a >60s idle).

Returns 503 if ~/.hermes/config.yaml doesn’t carry a usable model.base_url. The route wires TelemetryHub.report_inflight(inflight=True, tok_per_s=..., ttft_ms=..., lane_id=...) on stream start + ~every 16 tokens, then inflight=False on done — that’s the visible M3↔M4 round-trip the spec §4.2 review validates.

M4 — chat_event_stream(*, hub, request, body, resident, db_path)

Async generator powering POST /api/chat/stream. Extracted out of the route closure so unit tests can drive it directly against a stub of OpenAICompatClient (no live llama-server needed) — see test_chat_event_stream_emits_start_token_done + the two persistence + telemetry-wiring tests in fieldkit/tests/arena/test_server.py.

M4 — ChatSessionRecord / ChatTurnRecord

Stdlib dataclasses mirroring the spec §4.8 chat_sessions + chat_turns columns. Both default publishable=0 (operator-private). The M6 mirror exporter’s table allowlist hardcodes the chat_* tables out, so even rows the operator opts to promote (a future v0.2 workflow) never leak through the bulk mirror path.

RecordTableKey columns
ChatSessionRecordchat_sessionsid (FK target for chat_turns.session_id)
ChatTurnRecordchat_turns(session_id, ord) UNIQUE — append-only

ArenaStore exposes four helpers for these:

MethodReturnsNotes
.upsert_chat_session(row)NoneINSERT OR REPLACE INTO chat_sessions …; row may be a mapping or ChatSessionRecord.
.append_chat_turn(row)intStrict INSERT INTO chat_turns … (the (session_id, ord) UNIQUE is meaningful — duplicate ord is a programming error). Returns the rowid.
.chat_session(session_id)sqlite3.Row | NoneLookup by id.
.chat_turns(session_id)list[sqlite3.Row]All turns for a session, ordered by ord.

M5 — POST /api/compare/stream

Side-by-side rubric-scored compare against the resident brain (lane A — always; the single-brain envelope per [[project_spark_unified_memory_oom]] and spec §4.9 doesn’t allow two warm local lanes in v0.1) and a configurable B-lane. Default B is the OpenRouter frontier tier reached via the H6 CostRouterConfig (snapshot prices in the H6 article evidence; no concurrent local warm — safe by construction). Explicit two-local-lanes mode (lane_b="local:<id>") emits a structured error in v0.1 (code: "two_local_lanes_v0_2_only") so the UI can show the v0.2 affordance.

Body (Pydantic):

FieldDefaultNotes
promptrequired1–32 000 chars.
lane_b"openrouter""openrouter" (default, H6 frontier tier) or "local:<lane_id>" (v0.2).
rubric_idNoneServer picks from default_rubric_for_prompt when absent — patent prompts → patent_claim_validity, MCQ → mcq_letter, free-form → generic-correctness.
max_tokens4096Per-side token budget.
temperature0.0Deterministic by default.

Emitted SSE events (spec §4.3 event sequence):

  • start_a{run_id, side: "A", lane_id, model, base_url, rubric_id}. Painted as the A-column header; run_id threads back into the eventual POST /api/prefs call.
  • token_a{channel: "reasoning"|"content", text}. Channel classifier flips at the <think> / </think> boundary.
  • done_a{ttft_ms, tok_per_s, tokens_out, wall_s, finish_reason}. Pinned to the A-column footer.
  • start_b{side: "B", lane_id, model, base_url, no_key?}. no_key: true flags the OpenRouter-key-missing stub path so the UI can show an actionable “set OPENROUTER_API_KEY” message.
  • token_b / done_b — same shape as A.
  • score{run_id, rubric_id, a: {total, checks: [{name, kind, ok, why}]}, b: {…}, deltas: {score, speed_tok_per_s}}. Per-check ok + why strings paint under each side.
  • error{detail, code?, side?}. The two_local_lanes_v0_2_only code is the v0.1 advisory.

Returns 503 if ~/.hermes/config.yaml doesn’t carry a usable model.base_url. The route wires TelemetryHub.report_inflight(inflight=True, ...) on each side’s stream start + ~every 16 tokens, then inflight=False on score. Persistence: one compare_runs header row (publishable=1), two compare_responses rows (one per side), two rubric_scores rows (per side), all under the same run_id.

M5 — compare_event_stream(*, hub, request, body, resident, db_path)

Async generator powering POST /api/compare/stream. Extracted out of the route closure so unit tests drive it directly against stub clients (no live llama-server or OpenRouter needed) — see test_compare_event_stream_emits_full_sse_sequence + the persistence + thumbs-no-mutation + stub-no-key + two-local-lanes-v0.2 tests in fieldkit/tests/arena/test_server.py.

M5 — GET /api/rubrics

Returns the default rubric registry — three deterministic rubrics ship with v0.1. Each entry carries an id, title, description, and a flat list of check kinds so the picker dropdown can render the right column shape under each side.

IdTitleCheck kindWhat it asserts
generic-correctnessGeneric correctnessregexAnswer is non-empty (alphanumeric token present). The floor rubric for free-form prompts.
patent_claim_validityPatent claim validitysubstringAny of: anticipation, obviousness, written description, enablement, § 102 / § 103 / § 112, 35 U.S.C.. The patent-strategist canonical rubric.
mcq_letterMCQ letter (A/B/C/D)regexBare A/B/C/D, word-boundary, case-insensitive. The cyber-bench canonical rubric.

Operator-supplied rubrics layer on top via ~/.fieldkit/arena/rubrics/ at M6+ (a directory walk loaded at sidecar boot); the default list is always the head.

M5 — POST /api/prefs

Records one operator thumbs verdict on a compare run. Separate signal per spec §4.3 — writes a human_prefs row but does NOT mutate the corresponding rubric_scores.total. The leaderboard (M6) surfaces this as human_pref_winrate only at ≥5 prefs per lane.

FieldNotes
compare_run_idThe id from the start_a event. 404 on unknown.
winner"A", "B", or "tie" (Pydantic-validated).
noteOptional free text (≤2000 chars).

Returns {ok, pref_id, compare_run_id, n_prefs} — the count lets the picker UX lock further clicks once the operator has voted.

M5 — RubricSpec / DEFAULT_RUBRIC_REGISTRY / default_rubric_for_prompt

Frozen Python data — no YAML round-trip at runtime. RubricSpec is the registry entry (id + title + description + executable fieldkit.eval.Rubric); DEFAULT_RUBRIC_REGISTRY is the 3-entry built-in dict; default_rubric_for_prompt(prompt) is a substring-sweep picker (patent triggers → patent_claim_validity; (a) / (b) / (c) / (d) / multiple choicemcq_letter; otherwise generic-correctness).

HelperReturnsNotes
list_rubrics(registry=None)list[dict]JSON-safe shape for GET /api/rubrics.
get_rubric(id, *, registry=None)RubricSpec | NoneLookup, or None (the compare path falls through to generic-correctness rather than raising).
default_rubric_for_prompt(prompt)strThe spec §4.3 picker. Pure function, no I/O.

M5 — CompareRunRecord / CompareResponseRecord / RubricScoreRecord / HumanPrefRecord

Stdlib dataclasses mirroring the spec §4.8 compare_runs / compare_responses / rubric_scores / human_prefs columns. CompareRunRecord defaults publishable=1 — compare runs are the public-facing slice of the cockpit. CompareResponseRecord keys on (compare_run_id, side). RubricScoreRecord carries the JSON-serialized checks_json (one entry per CheckResult); the SQL CHECK constraint enforces at-least-one of compare_run_id / chat_turn_id is set.

RecordTableKey columns
CompareRunRecordcompare_runsid
CompareResponseRecordcompare_responses(compare_run_id, side) UNIQUE
RubricScoreRecordrubric_scoresautoincrement id; FK back to compare_run_id or chat_turn_id
HumanPrefRecordhuman_prefsid

ArenaStore exposes seven helpers for these — upsert_compare_run / upsert_compare_response / append_rubric_score (returns rowid) / append_human_pref / compare_run(id) / compare_responses(id) / rubric_scores_for_run(id) / human_prefs_for_run(id).

M6 — export_publishable_slice(store, out_dir, *, allow_empty, rebuild, repo_root)

The leak-proof boundary between the operator-private cockpit DB and the public mirror at ainative.business/arena/. Reads ONLY columns listed in PUBLISHABLE_TABLES; the chat_* tables and compare_runs.prompt / compare_responses.content / compare_responses.reasoning are never enumerated by any code path. Writes to <out_dir>/_staging/leaderboard.json first, fully fsync’d, then atomic-renames onto <out_dir>/leaderboard.json per [[reference_sync_workflow_nfs_mount]].

from fieldkit.arena import export_publishable_slice, ArenaStore

store = ArenaStore()
store.initialize()
with store:
    report = export_publishable_slice(store, out_dir="src/data/arena-mirror")
print(report.summary_line())
# → bench=12 live=2 compare_runs=4 rubric_scores=4 human_prefs=2 lanes=50
KwargDefaultWhat it does
storeAn open ArenaStore; caller is responsible for .initialize().
out_dir"src/data/arena-mirror"Target dir for the JSON files; resolved against repo_root if relative.
allow_emptyFalseIf False, refuse to write a zero-row leaderboard export (guard against blanking the public mirror).
rebuildTrueIf True, run rebuild_leaderboard as a pre-step. Set False if the caller has already rebuilt.
repo_rootNoneOverride for resolving a relative out_dir (mainly for tests).

Returns ExportReport (file paths + per-table counts + optional RebuildReport subreport). Raises PublishableSliceEmpty if allow_empty=False and both bench + live row counts would be zero.

M6 — rebuild_leaderboard(store)

Recomputes leaderboard_rows from bench_results (one row per (bench_slug, variant_label) with non-null pass-rate) + the live compare_runs × rubric_scores × human_prefs join (one row per (rubric_id, lane_id) aggregated across publishable runs). Live-cockpit rows use bench_id="cockpit:{rubric_id}" so they sort separately from bench-anchored rows. Human-pref winrate is gated at ≥5 prefs per spec §4.4 — under threshold the column is None.

Idempotent — re-running over the same DB produces identical rows. Returns RebuildReport(bench_rows_written, cockpit_rows_written, total_rows). Implicitly run inside export_publishable_slice unless rebuild=False.

M6 — Allowlist constants

fieldkit.arena.mirror surfaces three load-bearing constants the regression test pins against:

ConstantShapeWhat
PUBLISHABLE_TABLESdict[str, tuple[str, ...]]The hardcoded allowlist. The exporter NEVER reads a column from a table that isn’t a key here, and NEVER reads a column from a publishable table that isn’t in its tuple. compare_runs exposes redacted_prompt but NOT prompt. compare_responses exposes tokens_out / tok_per_s / unified_peak_gb but NOT content / reasoning.
FORBIDDEN_TABLEStuple[str, ...]("chat_sessions", "chat_turns", "lab_notes"). Belt over the allowlist’s suspenders — the exporter does not reference these by name; the regression test asserts the table NAMES don’t appear in the emitted JSON either. lab_notes added at v0.2 (operator-private Lab annotations).
FORBIDDEN_COLUMNStuple[tuple[str, str], ...]The (table, column) pairs that MUST NOT leak. (compare_runs, prompt), (compare_responses, content), (compare_responses, reasoning), the chat_turns columns, (lab_notes, body), plus (jobs, payload_json) (M8).
MIRROR_SCHEMA_VERSIONintBumped to 2 for M6 (was 1 at M2). Adds bench_rows / live_rows arrays alongside the legacy rows alias.

M8 — Arena as the control plane

The M8 milestone (_SPECS/spark-arena-v1.md §12) promotes Arena from a recorder into a dispatcher — the place the operator triggers work from. ~/.fieldkit/arena.db gains three operator-private tables (additive + idempotent over the v0.2 schema): jobs (the queue spine) and job_triggers (the audit trail) at PRAGMA user_version = 3, plus leaderboard_baseline (the regression detector’s prev-snapshot store) at user_version = 4. All three are on FORBIDDEN_TABLES; (jobs, payload_json) is on FORBIDDEN_COLUMNS — job payloads carry prompts/lanes/benches and are never mirrored (R13). The dispatcher executes through the fieldkit.harness MCP surface (M8-1) — one execution surface shared with Hermes, so the containment rails are defined once.

M8 — records (fieldkit.arena.schemas)

RecordTableNotes
JobRecordjobsThe queue row. kind is eval_rerun / measure_variants (M8) or a later-phase stub; statusqueued/dispatched/running/done/failed/skipped; payload_json is operator-only; dedup_key = (kind, lane_id, bench_id) coalesces in-flight duplicates (R15), None = always-run; arq_job_id is the eval_runs socket (None on the M8 BackgroundTasks path).
JobTriggerRecordjob_triggersWhat fired a job: a regression delta, a staleness age, or an operator note. id is AUTOINCREMENT (omit on insert).

M8 — JobKind / JobStatus

SymbolMembers
JobKindEVAL_RERUN, MEASURE_VARIANTS (the DISPATCHABLE set), plus the named-but-not-built stubs REQUANT, RL_RUN, REINDEX, RAG_EVAL, SCOUT_INGEST. DISPATCHABLE / ALL are frozensets.
JobStatusQUEUED, DISPATCHED, RUNNING, DONE, FAILED, SKIPPED; IN_FLIGHT is the dedup-holding subset.

M8 — enqueue_job(store, kind, payload, *, trigger, priority, dedup_key, trigger_detail, now_fn)

Writes one queued row and returns its id, or None when an in-flight job already holds the dedup_key (the R15 coalesce). Records a job_triggers audit row when trigger_detail is given.

KwargDefaultWhat it does
trigger"manual"Provenance: manual / leaderboard_regression / stale_bench / …
priority0Higher drains first (regression confirmations enqueue at 1).
dedup_key(kind, lane_id, bench_id) from payloadPass "" to force an always-run job, or a custom key. None-resolving keys never coalesce.
trigger_detailNoneWhen set, also writes the job_triggers audit row (the regression delta / staleness age / operator note).
now_fnUTC ISO stampInjectable clock (deterministic tests).

M8 — dispatch_job(store, job, *, runner, now_fn) / drain_jobs(store, *, runner, max_jobs, now_fn, on_error)

dispatch_job runs one claimed job end-to-end: running → execute via runner (default default_runner, which calls the harness MCP tools — run_vertical_eval / measure_variants) → done (persisting an eval_rerun through the existing eval_scores scorer path + activating the eval_runs status row) or failed (stamping jobs.error + raising JobDispatchError). runner is injectable so tests dispatch without a GPU. drain_jobs claims the oldest queued job and dispatches it in a loop until empty (M8-5, sequential single-lane); max_jobs caps a pass, on_error ("record" default / "raise") controls whether a failed job halts the drain.

M8 — detect_leaderboard_regression(prev, curr, *, tau) / enqueue_regressions(store, prev, curr, *, tau, now_fn)

detect_leaderboard_regression is the pure, testable core: diff two ArenaStore.eval_leaderboard() accuracy-rollup snapshots and return one {bench_id, lane_id, prev_score, new_score, delta} per (bench, lane) whose mean_normalized dropped by more than tau (default 0.05), worst-drop first. Newly-seen lanes can’t regress. enqueue_regressions runs the detector and enqueues a confirming eval_rerun (priority 1, leaderboard_regression trigger) per regression — coalescing duplicates while one is in flight.

M8 — check_and_enqueue_regressions(store, *, tau, now_fn)

The wired regression producer (M8-2) — the link between the pure detector and the running cockpit. Diffs the live eval_leaderboard() against the stored leaderboard_baseline, enqueues a confirming eval_rerun per over-tau drop (R15 dedup applies), then overwrites the baseline with the current snapshot. The first scan only sets the baseline (nothing to diff against → no enqueues), so a fresh box never storms. Returns {checked, baselined, had_baseline, enqueued: [job_id, …], regressions: [delta, …]}. Operator-triggered via POST /api/jobs/check-regressions (a Jobs-page button); the Phase-2 cron calls the same path on a schedule.

M8 — resolve_bench(bench_id, *, bench_dir) / DEFAULT_BENCH_DIR

Resolves a bench_id{bench_path, scorer, max_tokens, limit} from the bench registry — a directory ($ARENA_BENCH_DIR or DEFAULT_BENCH_DIR = ~/.fieldkit/arena/benches) holding one <bench_id>.jsonl gold set per bench, with an optional <bench_id>.meta.json sidecar overriding the scorer (default exact_match) and the eval knobs. Returns None when no gold set is registered. default_runner calls this to fill an eval_rerun’s bench_path when the job payload (a regression trigger, the UI dispatch form) carries only a bench_id; an unresolvable bench raises BenchNotRegistered naming the exact path searched, rather than failing opaquely deep in the eval tool. An explicit payload bench_path still wins.

M8 — errors

ErrorRaised when
JobDispatchErrorA job failed mid-execution; the row is already marked failed with the message in jobs.error.
UnknownJobKindAn enqueue named a kind outside JobKind.ALL, or a dispatch named a stub outside JobKind.DISPATCHABLE.
BenchNotRegisteredAn eval_rerun named a bench_id with no resolvable gold JSONL (no payload bench_path, no registered <bench_id>.jsonl). The message names the path searched.

M8 — sidecar endpoints (/api/jobs)

MethodPathNotes
GET/api/jobs?status=&limit=The board read — newest first, optional status filter. Empty (not 404) on a fresh box.
POST/api/jobsEnqueue {kind, payload, trigger, priority, dispatch}. dispatch=True (default) drains the queue in a BackgroundTask (the M8 primary single-lane path, R14 — no arq/Redis). Returns coalesced=True when the dedup gate fires.
POST/api/jobs/check-regressions?tau=&dispatch=Scan the live leaderboard vs the baseline, enqueue a leaderboard_regression eval_rerun per over-tau drop, re-baseline. First scan only sets the baseline. Returns {checked, had_baseline, enqueued, regressions}. Declared before {job_id}.
GET/api/jobs/streamSSE — emits a full board snapshot on connect + on change (declared before {job_id} so it isn’t captured as an id).
GET/api/jobs/{job_id}One job + its trigger trail; 404 if unknown.
DELETE/api/jobs/{job_id}Cancel a not-yet-running job (→ skipped); 409 if running/done, 404 if unknown.

M8 — store methods (ArenaStore)

MethodReturnsWhat
.enqueue_job(row)str | NoneStrict INSERT; None when the dedup unique-index coalesces.
.record_job_trigger(row)intAppend a job_triggers audit row.
.claim_next_job(*, dispatched_at)sqlite3.Row | NoneAtomically flip the oldest queued job to dispatched.
.update_job(job_id, **fields) / .get_job(id) / .list_jobs(*, status, limit) / .cancel_job(id)Patch / read / board-list / cancel.
.upsert_eval_run(row) / .update_eval_run(id, **fields) / .get_eval_run(id)The per-run status row M8 activates (the arq_job_id socket).
.leaderboard_baseline() / .snapshot_leaderboard_baseline(rows, *, now)list[Row] / intRead / full-overwrite the regression baseline (one (bench, lane) accuracy row each) that check_and_enqueue_regressions diffs against.

M9 — cost plane (Bet 6)

The third ranking axis — token economics promoted to a first-class signal (_SPECS/spark-arena-v1.md §13). The cost the compare/chat path already computes (_compare_cost_usd) is now persisted and surfaced: per-run rows feed an aggregate $/quality-point on the public leaderboard, and the live spend rail survives a sidecar restart. The full API lives in its own module — fieldkit.cost (CostLedger, PriceSnapshot, seed_price_snapshot, cost_per_quality) — because it spans the new openrouter_price_snapshot table, not just fieldkit.arena. It is a ledger, not a governor: enforcement (fieldkit.budget) is Phase 2 (Arena M11, §15).

What changed inside fieldkit.arena:

  • Schema user_version 4 → 5 — the first ALTER-based migration (ArenaStore._migrate / _add_column_if_missing, R18). Adds the per-run cost columns to chat_turns / compare_responses, the aggregate mean_cost_usd / cost_per_quality_point to leaderboard_rows, and the new openrouter_price_snapshot table (seeded at initialize() from the baked H6 evidence via fieldkit.cost.seed_price_snapshot).
  • server.py — the compare _emit_side + chat completion paths INSERT cost_usd / tokens_in / tokens_estimated / price_snapshot_id onto the response row at the point they call add_openrouter_cost; TelemetryHub. seed_session_spend rehydrates the live rail from CostLedger.session_spend() at create_app (M9-8). Local lanes write 0.0.
  • mirror.pyrebuild_leaderboard computes mean_cost_usd (AVG over the bench×lane runs) + cost_per_quality_point (mean_cost_usd / mean_score, guard >0). openrouter_price_snapshot joins PUBLISHABLE_TABLES (public — no prompts), the two aggregate cost columns join the leaderboard_rows allowlist, and the per-run cost columns inherit their host tables’ exclusion (M9-7, anchored by test_mirror_does_not_leak.py).

M10 — recall layer (Bet 5)

The Second Brain promoted from a manual, prose-only, externally-scripted index into a managed, multi-source, evaluated, provenance-tagged one the operator drives from the cockpit (_SPECS/spark-arena-v1.md §14). The full ingest / query / coverage API lives in its own module — fieldkit.memory (MemoryIndex, KnowledgeCard, Provenance, ingest_sources, coverage_report, resolve_qa_set) — because it spans pgvector blog_chunks, not just fieldkit.arena. It ships the operator-driven pane + managed index; the autonomous re-index-on-publish hook + scheduled freshness monitor are Phase 2 (Arena M11, §15), which consumes this pane’s re-index button + eval gate.

What changed inside fieldkit.arena:

  • Schema user_version 5 → 6 — additive CREATE TABLE IF NOT EXISTS for reindex_runs (per-rebuild provenance — operator-private) and rag_eval_runs (eval scores per index version — public-safe aggregates). The pgvector provenance ALTER lives in fieldkit.memory.MemoryIndex.ensure_schema (R21), not the arena store. Store readers/writers: insert_reindex_run / update_reindex_run / reindex_runs and insert_rag_eval_run / rag_eval_runs / last_rag_eval (the promotion-gate baseline).
  • jobs.pyreindex / rag_eval / scout_ingest promoted from JobKind named stubs into JobKind.DISPATCHABLE (M10-1, the move M8 made for eval_rerun). default_runner dispatches each through the fieldkit.harness MCP surface (reindex_memory / rag_eval_index / scout_ingest); _persist_reindex writes a reindex_runs row, _persist_rag_eval writes a rag_eval_runs row and applies the promotion gate (M10-6 — a recall- dropping rebuild is flagged promote=False, like-for-like per R22).
  • server.py — the /api/knowledge pane API: a degraded-safe coverage + trend + run-history snapshot (GET), POST /api/knowledge/reindex (+ chained rag_eval), POST /api/knowledge/rag-eval, and the operator-private POST /api/knowledge/query (provenance-filtered chunk text — 503 when the live index is unreachable). The jobs-board kind pattern widens to accept the three new dispatchable kinds.
  • mirror.pyrag_eval_runs aggregates join PUBLISHABLE_TABLES for the public RAG-eval trend (no prompts, no chunk text); reindex_runs joins FORBIDDEN_TABLES (its source_set can name internal slugs). A knowledge-path sentinel anchors test_mirror_does_not_leak.py (M10-10).
  • Cockpit — a new /arena/knowledge/ pane: coverage/freshness (the article_index ⋈ index diff, M10-8), a per-source-class Re-index button, the RAG-eval trend (cosine-only labelled, M10-7), and the trust-tier query console.

M11 — autonomous harness + cron (Phase 2)

The hands in the pane → hands → engine sequence (_SPECS/spark-arena-v1.md §15): the missing trigger that turns M8’s button-driven dispatcher into a self-operating overnight loop with a human-review gate. M11 reimplements no dispatch — it schedules the already-built drain_jobs() + check_and_enqueue_regressions(), gated by the new fieldkit.budget governor (BudgetGovernor, BudgetDecision, SpendDigest, EscalationReason, MemoryEnvelope, check_budget) — a sibling top-level module, because the governor spans more than fieldkit.arena. No schema, no user_version bump (AH-9) — the schema stays at M10’s 6; schedules live in version-controlled config, the standup is an ephemeral render.

What changed inside fieldkit.arena:

  • scheduler.py (new) — the cron glue (AH-1). run_drain_cycle(store, *, governor=None, …) is one tick: acquire the one-drain-at-a-time DrainLock (the scheduled_tasks.lock pattern with stale-pid stealing — never stacks a second GPU lane, R24), drain_jobs with the governor in the loop, the check_and_enqueue_regressions freshness sweep (AH-6 — emits the next tick’s triggers), then build_standup. Returns {skipped, drained, sweep, standup}; no push path exists by construction (R26). build_standup(store, *, governor, sweep, cap_usd) is the AH-3 render — Ran / Regressed / Queued / Spend over the existing jobs / leaderboard_baseline / M9 cost rows, aggregate + operator-private (it projects id/kind/status, never payload_json).
  • jobs.pydrain_jobs gains an optional governor (duck-typed — anything with .check_budget(job) -> BudgetDecision). Each claimed job is checked before dispatch: an allow dispatches; an escalate / defer releases the claim back to queued, records a budget_<action> audit row in job_triggers, and stops the pass (the budget brake). The drain never escalates or pushes itself — it stages the decision (AH-3/AH-8).
  • server.pyGET /api/standup renders the standup snapshot (the cost ledger is read via a BudgetGovernor(ledger=store); the Spend row degrades to ”—” pre-M9, AH-5). Read-only — it never drains (an HTTP GET never launches a GPU lane; the cron owns dispatch). Empty (not 404) on a fresh box.
  • Cockpit — a new /arena/standup/ pane (the morning-review gate): the Spend rail + the Ran / Regressed / Failed / Queued buckets, stage-only (“the loop has no push path”).
  • Hook battery (.claude/) — the lone SessionStart hook expands into a battery (AH-2, deterministic shell only, invariant #4): pre_commit_guard.sh (PreToolUse — secret-scan hard-blocks a planted secret, the render verifiers run advisory per R25), post_publish.sh (PostToolUse — stats nudge + freshness-trigger enqueue on an articles/products commit), and stop_feedback.sh (the §6.5 Stop loop, finally wired — nudges on uncommitted artifact work).

RL-lane autonomy (lane.py, rl-lane-autonomy v1 — LA-1..11)

The self-driving layer for the Phase-3 engine — the connective tissue that turns a dispatchable rl_run (RV-6) into a run that is self-driving, observable, and self-defending, without re-implementing any GPU physics. import fieldkit.arena.lane stays stdlib-cheap (torch/vLLM only enter inside the lane factory). No schema change (LA-7 — user_version stays 6); no new top-level module (it documents here, under arena).

  • LaneArbiter (LA-1/2/6) — the envelope-gated single serving slot, a context manager the GPU-kind runner enters. __enter__ runs the 3-way pre-flight (governor allowMemoryEnvelope.fits ∧ a vLLM binary present — any failure raises LaneDeferred before anything is torn down), frees the resident chat brain (stop_resident), and starts the MemoryWatchdog. __exit__ stops the watchdog, tears down the vLLM lane (VLLMLane.stop, EngineCore-aware — its process-pattern pkill reaps the seam-started server too), and always restores the prior lane (R1: never leave the box with no serving lane). It composes inside the M11 DrainLock, never replaces it (LA-2).
  • MemoryWatchdog (LA-10, arena-wide) — enforces a unified-memory headroom floor off the same /proc/meminfo source TelemetryHub samples. Warns below FK_RL_OOM_WARN_GB (8); on a breach that persists persist_n samples (~2 s — the R6 anti-transient guard) it touches an abort sentinel the loop polls between steps and records the trip on the trace. It never trips on a missing sample (R7). Reusable by every GPU kind.
  • mem_trace / MemTrace (LA-11) — the per-run memory recorder (peak, headroom-at-spawn, per-phase deltas, abort sample). Thread-safe; rides jobs.result_json + the standup (“RAN 1 · peak 119 GB · 1 OOM-deferred”).
  • RLLaneContext — the one optional object dispatch consults for an rl_run. dispatch_job(store, job, *, rl_lane=…) and drain_jobs(store, *, rl_lane=…) take it; when wired and the kind is rl_run the run is arbitered (pre-flight → resident-brain teardown → watchdog → live progress → mem-trace) and a failed pre-flight releases the claim back to queued + audits (budget_<action>, never fails); when None (the M8 default) every kind runs bare, byte-for-byte RV-6 behavior. Defaults read FK_RL_OOM_* + FK_RL_RESIDENT_{STOP,START}_CMD.
  • Live progress (LA-8)rl_progress_writer(store, job_id, …) builds the throttled single-writer callback the loop pushes {step, phase, pool_score, last_heldout, eta_s, mem} through (a write per phase-change/held-out-gate, else ≤ once per throttle_s). _jobs_signature gains a progress nonce so the /api/jobs/stream board re-emits while a run is running.
  • Async-enqueue (LA-4)POST /api/jobs now accepts rl_run but forces dispatch=False (RV-6): the 8.5 h loop never runs in a request’s BackgroundTask. The response carries async_only: true + an autonomy note.
  • Autonomy CLI (LA-5)fieldkit arena autonomy on|off|status writes the reversible policy record (fieldkit.arena.scheduler.read_autonomy_state) and prints/installs the crontab line; fieldkit arena drain is the cron target (one run_drain_cycle tick). The standup surfaces the armed state + the RL memory digest. The external blocker is unchanged (a pinned aarch64+CUDA-13 vLLM); absent it the arbiter defers cleanly (LANE_BIN_ABSENT), so the whole surface ships + is GPU-free-testable now. See docs/api/rl.md → “Operator: full autonomy” + _SPECS/rl-lane-autonomy-v1.md.

v0.2 surfaces (Lab + distribution)

v0.2 — Lab notes (lab_notes table + /api/lab/notes)

Operator-private annotations pinned to a Lab board card, powering /arena/lab/’s <LabNotes> island. Deterministic CRUD only — no LLM generation (feedback_llm_skill_pattern). The lab_notes table is on FORBIDDEN_TABLES + pinned by test_mirror_does_not_leak.py, so the freeform body is never mirrored.

ArenaStore methods: append_lab_note(row) -> int (append-only insert; caller stamps created_at), lab_notes(card_id=None, limit=200) -> list[Row] (newest first, optionally scoped to one card; rows carry body — loopback-only reads, same stance as the chat-replay endpoint), delete_lab_note(note_id) -> bool.

MethodEndpointBody / paramsReturns
GET/api/lab/notes?card_id=&limit=optional card_id scope{notes: [{id, card_id, lane, body, created_at, updated_at}]} (empty list on cold DB, never 500)
POST/api/lab/notesLabNoteRequest{card_id, body, lane?}{ok, note_id, card_id, n_notes}
DELETE/api/lab/notes/{note_id}{ok, note_id}; 404 if absent

v0.2 — packaged web UI (fieldkit arena build / up)

Arena’s primary distribution surface is the fieldkit PyPI wheel. fieldkit.arena.webui.build_webui(repo_root, *, dest, skip_astro, demo) runs the Astro build (base: '/arena') and prunes the routed pages + shared assets (raster images dropped) into a self-contained bundle. Two modes:

  • wheel (default) — ARENA_BUILD=1 → packaged fieldkit/src/fieldkit/arena/_webui/ (declared in pyproject.toml’s hatch include); served by the sidecar’s StaticFiles mount.
  • demo (demo=True) — ARENA_DEMO=1<repo_root>/dist-arena-demo-pruned/ for the sidecar-less public web preview (GitHub Pages). The prune promotes arena/* to the bundle root (so /arena/ is the cockpit and the absolute single-/arena/ nav hrefs resolve), additionally copies the demo-only arena-demo/ dir (the fetch/EventSource shim + recorded fixtures.json), and writes a .nojekyll marker (GitHub Pages’ Jekyll would otherwise strip assets/_slug_*.css). Deploy = copy the bundle’s contents into the publisher’s public/arena/.

webui_dir() / bundle_present() locate the wheel bake. create_app() mounts it via a StaticFiles mount at /arena (_mount_packaged_webui, guarded — a missing bundle degrades to API-only mode). Served from the sidecar → page origin == sidecar origin == same-origin, so the islands’ resolveSidecarUrl() resolves to their own origin and CORS is dev-only.

CLIWhat
fieldkit arena build [--repo-root …] [--skip-astro] [--demo]Builder-side only — bake the bundle (shells out to node node_modules/astro/astro.js build). Default bakes the wheel bundle (run at release time); --demo bakes the GitHub Pages preview into dist-arena-demo-pruned/.
fieldkit arena up [--host --port --db --open/--no-open]The one-command UX — serve the cockpit and open a browser tab. pip install fieldkit[arena]fieldkit arena uphttp://127.0.0.1:7866/arena/.

M2 — ArenaStore

Synchronous SQLite store at ~/.fieldkit/arena.db. Used by the M2 importer + the future M6 mirror exporter; the M3 FastAPI sidecar opens a parallel async connection via aiosqlite against the same database file (SQLite handles concurrency via WAL).

MethodReturnsNotes
ArenaStore(db_path=None)Path defaults to ~/.fieldkit/arena.db (operator-private, gitignored).
.initialize()NoneCreates the 13-table schema + indexes (idempotent — every DDL is CREATE TABLE IF NOT EXISTS); pins PRAGMA user_version=1 for forward migration.
.connect()sqlite3.ConnectionOpens lazily; enables WAL + foreign keys; returns the conn for raw SQL access.
.close()NoneCommits + closes. Also called by the with block on exit.
.transaction()Iterator[sqlite3.Connection]Batch helper: commits on success, rolls back on exception.
.initialize.user_versionintRead-only — current PRAGMA user_version.
.table_names()list[str]Sorted; introspection for tests + curator audit.
.count(table)intValidates table name against sqlite_master; returns 0 for unknown tables.
.upsert_lane(row)NoneINSERT OR REPLACE INTO lanes …; row may be a mapping or a LaneRecord.
.upsert_bench_result(row)NoneSame shape, keyed on (bench_slug, variant_label).
.upsert_article(row)NoneKeyed on slug.
.upsert_hf_meta(row)NoneKeyed on repo_id.
.upsert_notebook_export(row)NoneKeyed on file_path.
.upsert_leaderboard_row(row)NoneKeyed on (bench_id, lane_id).
.lanes() / .articles() / .bench_results(slug=None) / .leaderboard_rows()list[sqlite3.Row]Read helpers the importer + future mirror exporter both call.

The store also exposes DEFAULT_DB_PATH (os.path.expanduser’d at module load) and USER_VERSION (the schema version pin).

M2 — Row records

Stdlib dataclasses records. Each maps 1:1 to a table column; the importer constructs them via asdict() and feeds them to ArenaStore.upsert_*.

RecordTableKey columns
LaneRecordlanesid (composite of {manifest_slug}::{variant} for quant/lora, {slug}::nav for harness/skill/bench/notebook, {label}::brain-bakeoff for the seeded brain lanes)
BenchResultRowbench_results(bench_slug, variant_label)
ArticleIndexRowarticle_indexslug
HfMetaRowhf_metarepo_id
NotebookExportRownotebook_exportfile_path
LeaderboardRowleaderboard_rows(bench_id, lane_id)

The records ChatTurnRecord / CompareRunRecord / RubricScoreRecord / HumanPrefRecord are declared at M2 (for the M3+ import path) but not yet exported via __all__ until their milestone lands.

M2 — import_artifacts(repo_root, db_path, dry_run, refresh_hf, write_mirror, hf_cache_dir)

The deterministic-Python spine of fieldkit arena import. Walks the repo’s src/content/artifacts/, articles/*/, notebooks/*/exports/**, ~/.hermes/config.yaml, and (optionally) the HuggingFace API into ~/.fieldkit/arena.db, then writes src/data/arena-mirror/leaderboard.json so the cockpit landing (M3) ships non-empty.

KwargDefaultWhat it does
repo_rootthe checkout this fieldkit ships inWalk a different repo (mainly for tests).
db_path~/.fieldkit/arena.dbSQLite to populate. Ignored when dry_run=True (:memory: is used).
dry_runFalsePlan-only mode: in-memory SQLite, no on-disk writes, the report’s row counts reflect what would have landed.
refresh_hfFalseHit the HF API once per Orionfold/ repo + write a 24h cache to ~/.fieldkit/arena_cache/hf/. Default False keeps the importer offline-safe.
write_mirrorTrueWrite src/data/arena-mirror/leaderboard.json from the seeded leaderboard rows. Set False in test runs.
hf_cache_dir~/.fieldkit/arena_cache/hfOverride the HF cache root (mainly for tests).

Returns an ImportReport with the post-upsert row counts + a warnings list (every malformed-manifest or unknown-bench-shape goes here rather than raising). The report’s counts are post-upsert totals, NOT rows written this run — so a re-run with identical inputs returns the same numbers (the M2 idempotency gate).

from fieldkit.arena import import_artifacts

report = import_artifacts(dry_run=True)
print(report.summary_line())
# → lanes=40 bench_results=17 article_index=55 hf_meta=13 notebook_export=54 leaderboard_rows=3

M2 — runnable script form

The Typer CLI (fieldkit arena import …) and a python -m shim share one code path:

# Plan-only — prints row counts without writing
python -m fieldkit.arena.scripts.import_existing --dry-run

# Real run + refresh HF metadata over the wire (writes to ~/.fieldkit/arena_cache/hf/)
python -m fieldkit.arena.scripts.import_existing --refresh-hf

# Identical via the Typer CLI
fieldkit arena import --dry-run
fieldkit arena import --refresh-hf

Errors

ExceptionRaised when
ArenaErrorBase for every error the module raises — catch this to catch them all.
LaneNotRegisteredA lane lookup hits an id not in the lanes table (M2+ surface). Defined at M1 so callers can except it without waiting on store.py. Subclass of ArenaError.
PublishableSliceEmptyexport_publishable_slice produced a zero-row leaderboard JSON (M6+ surface) — a guard against accidentally blanking the public mirror. Operator opt-out via --allow-empty. Subclass of ArenaError.

Constants

NameValueWhy
ARENA_SURFACE_VERSION"0.1.0a0" (M1) → "0.1.0" at M7Independent of fieldkit.__version__ so a downstream tool can gate on the surface (arena ≥ 0.1.0) without pinning the whole package.
DEFAULT_ARENA_PORT7866Spec §3.4 — mnemonic Spark+Arena reads; free across the existing port map (llama-server :8080, NIM :8000, pgvector :5432, Astro dev :4321, Redis :6379).
DEFAULT_ARENA_DB"~/.fieldkit/arena.db"Operator-private SQLite, NOT in repo (gitignored). Created lazily on first sidecar boot at M2.

The arena extra

The sidecar deps ship via an optional install:

pip install 'fieldkit[arena]'
DepWhy
fastapi>=0.115sidecar HTTP framework — async-native, pydantic schemas, lifespan, SSE-friendly
uvicorn[standard]>=0.30ASGI server (M3 launcher)
sse-starlette>=2.1SSE helpers (telemetry pump, chat stream adapter, compare event sequence)
aiosqlite>=0.20async access to ~/.fieldkit/arena.db
arq>=0.26v0.2 job queue (eval-runner pane); installed at M1 for forward compat, runtime-optional in v0.1
redis>=5.0arq broker (Redis already on the box for pgvector); runtime-optional in v0.1
huggingface_hub>=0.24M2 retroactive import reads HF repo metadata for the 13 Orionfold/ repos
pyyaml>=6.0reads src/content/artifacts/*.yaml manifests + ~/.hermes/config.yaml

import fieldkit.arena is stdlib-only — none of these are loaded on package import. FastAPI ships behind create_app() (M3); aiosqlite behind ArenaStore.initialize() (M2); huggingface_hub behind the M2 import script.

CLI surface (locked at M1, bodies fill across the arc)

$ fieldkit arena --help
Usage: fieldkit arena [OPTIONS] COMMAND [ARGS]...

  Operator cockpit for the DGX Spark (M1 stub; M3 fills the sidecar).

Commands:
  serve                Launch the FastAPI cockpit sidecar (M3).
  import               Retroactive load: manifests + articles + benches + HF (M2).
  mirror               Export leak-proof publishable slice (M6, this release).
  rebuild-leaderboard  Recompute denormalized `leaderboard_rows` (M6).
  memcheck             Print unified-memory envelope + warm-lane footprint (stub).
  promote-run          Mark a `compare_run` as publishable + supply redaction (stub).

serve / import / mirror / rebuild-leaderboard are live as of M6; memcheck and promote-run ship at M7. The CLI shape was frozen at M1 so the bodies can fill across the arc without drifting the operator-visible contract.

Milestone roadmap (forward-looking; not API)

MilestoneSurface
M1scaffold — __init__.py + schemas.py + server.py skeleton + cli.py stubs + the arena extra
M2scripts/import_existing.py retroactive load + SQLite schema (spec §4.8); ArenaStore materializes
M3 (this release)create_app() + lifespan + GET /api/telemetry/stream SSE backed by fieldkit.harness.Telemetry + GET /api/lanes + GET /api/leaderboard + <TelemetryGauge> Preact island on /arena/
M4<ChatLane> + POST /api/chat/stream SSE proxy to llama-server :8080
M5<CompareDuel> + POST /api/compare/stream + deterministic rubric scoring via fieldkit.eval.score_answer
M6 (this release)mirror.py with hardcoded allowlist guard (chat_* tables NEVER enumerated); /arena/leaderboard/ Astro page; regression test asserts zero leaks against random-UUID sentinels
M7fieldkit v0.14.0 cut + articles/introducing-spark-arena-on-spark/ + Mac /sync-field-notes push

Each milestone is independently shippable on origin/main; the M2/M5/M6 risky milestones each have an explicit “fresh-session hint: YES” marker in HANDOFF.md’s ARENA TRACK section.

Cross-module reuse (M3 onward)

Arena calls into — never duplicates — these symbols:

SymbolSourceUsed for
serve_lane, LaneSpec, NIMLane, LlamaServerLane, VLLMLane, OllamaLanefieldkit.harnessevery lane swap routes through serve_lane(guard=True, headroom_gb=8.0) — the only enforcer of the single-brain envelope
Telemetry, measure_throughputfieldkit.harnessthe GPU% / unified-mem / temp / tok/s / ttft sampler
CostRouterConfig, RouteTier, build_cost_routerfieldkit.harnesspowers the default Compare B-lane (OpenRouter via tier predicates)
RouterConfig, build_vertical_routerfieldkit.harnessH5 vertical router — surfaced as the v0.2 “route this compare per-vertical” mode
score_answer, Rubric, CheckSpec, GradedPromptSuite, load_rubric, HEDGE_PHRASESfieldkit.evalevery rubric score event; the deterministic scoring axis
OpenAICompatClient, split_think, stream_reply, discover_local_serverfieldkit.notebookchat client forwarding; <think> prefix split per feedback_nim_think_prefix_convention
ArtifactManifest, ArtifactKind, ORIONFOLD_HF_HANDLEfieldkit.publishread-only; drives artifact browser; v0.2 HF dataset push reuses HFHubAdapter
Capabilities.load(), weight_bytes, kv_cache_bytesfieldkit.capabilitiesthe serve_lane guard’s memory math

Per feedback_keep_scorer_local_until_reuse, ad-hoc rubrics live at ~/.fieldkit/arena/rubrics/ until a 2nd reuse triggers promotion to fieldkit/src/fieldkit/eval/rubrics/.

See also

  • _SPECS/spark-arena-v1.md — the locked v1.0 spec; section numbers referenced throughout this page.
  • HANDOFF.md 🏟️ ARENA TRACK section — the session-by-session milestone breakdown.
  • ideas/spark-arena.md — the living-doc tracking of gate decisions + execution updates (per feedback_ideas_docs_living).
  • fieldkit.harness API page — the sibling content line’s module reference.

Articles that use fieldkit.arena