What it is
The engine in pane → hands → engine — the trainer half of
_SPECS/rlvr-loop-v1.md (fieldkit.reward is the scorer half). It closes the
loop the first four roadmap milestones were built to land: eval → reward →
fine-tune → re-eval, with the Spark’s own fieldkit.eval verifiers as the
reward function. Arena M8 dispatches it, M11 schedules + budget-brakes it
overnight, M9 prices its RL-vs-pay ROI, M10 answers “has this been tried”
before it runs.
fieldkit.rl wraps the loop that actually ran on a single GB10 —
clawgym-on-spark-grpo’s hand-rolled ~280-LOC REINFORCE-with-KL +
kill-and-restart-vLLM, not Unsloth-GRPO or NeMo-RL (neither drove the
working run; [[project_verl_atgpo_vllm_gap]] is the cautionary precedent). The
named abstractions are a documented fallback lane, not the default (RV-1).
Orchestration ships; GPU math injects
This module owns the control flow, not the GPU kernels: split → sample a
group → reward → group-relative step → held-out gate → held-out-only checkpoint
selection → lineage card is pure, deterministic, and unit-testable. The three
GPU-touching seams are injected — the same dispatch_job(runner=…) pattern:
- a sampler (
tasks, k → list[list[Rollout]]) — the pinned-vLLM rollout (~13 min/step, the dominant cost); - a trainer (
rollouts, advantages, step → metrics) — the REINFORCE-with-KL LoRA step (~22 s) plus the kill-and-restart vLLM to reload the adapter (~3.5 min — the eliminable quarter, RV-5; hot-LoRA-swap is the tracked fast-follow, not a v1 gate); - a heldout_eval (
step, tasks → float) — the held-out gate (RV-4), in the Arena path a dispatched M8eval_rerunover the frozen split.
torch / vLLM are never imported at module load — only inside the GPU-seam
factory gpu_seams, so import fieldkit.rl stays stdlib-cheap and the
orchestration tests run with no GPU. The real backend is now vendored (the
ported clawgym-on-spark-grpo loop): gpu_seams returns the three real seams
when the fieldkit[rl] extra is installed, and raises a friendly RLLoopError
pointing at the extra when it isn’t. It splits torch-free / torch-bound so the
GPU stack is touched only on a live call:
fieldkit._rl_gpu_serve(torch-free) — the HTTP rollout sampler over the local pinned-vLLM OpenAI endpoint (viafieldkit.nim.NIMClient; the GPU lives in a separate vLLM server process), the held-out gate, and theVLLMLanekill-and-restart serve lifecycle;fieldkit._rl_gpu_trainer— the torch/peft REINFORCE-with-KL step (the K3 KL estimator against a frozen CPU-resident reference snapshot, advantage-weighted loss, AdamW, adapter save). Importing it is thefieldkit[rl]gate.
Three corrections, encoded structurally
- Pool-convergence is a trap (RV-4).
t2pohit pool 87.5% vs held-out 5.7% at step 45 — an 81.8 pp inversion. The loop evaluates a frozen held-out split every ≤heldout_everysteps and selects the published checkpoint on held-out score, never pool. - Pin vLLM (RV-5). Six API drifts across two minor versions (one silent
return-shape change).
GRPOConfig.vllm_pinrecords the pinned version. - ≥100-row corpus floor (RV-10). 42 rows mode-collapsed.
RLLoop.runrefuses a corpus belowcorpus_minand carves the held-out split before step 0, trainer-resident, one vLLM lane (the 2026-04-22 OOM landmine).
The surfaces
GRPOConfig— the hyperparameters, defaulting to the provenclawgymknobs:base,lora_rank=16,group_k=4,tasks_per_step=8,temp=0.8,heldout_every=10(the hard gate),vllm_pin,max_steps=34,corpus_min=100(the floor),heldout_frac=0.2,kl_coef,lr,seed(makes the split + per-step draw deterministic), andheldout_patience(default off — the RV-R1 inversion guard that stops the run when held-out stops improving). Validates on construction.RLLoop— the driver. Construct with aGRPOConfig, afieldkit.reward.RewardAdapter, and a bench (any object with aquestionslist —fieldkit.eval.VerticalBench); injectsampler/trainer/heldout_eval..run()runs tomax_stepsand returns the run’sfieldkit.lineage.LineageSnapshot— therl_runcard and the source of the §5 “living-model” delta chart (RV-7). After.run(),.heldout_scores,.pool_scores,.selected_step, and.selected_heldout_scoreexpose the held-out-only checkpoint pick;.summary()is the aggregate digest the Arena dispatcher persists tojobs.result_json.RLLoopError— raised on a corpus belowcorpus_min, an invalidGRPOConfig, or a.run()with a missing GPU seam.rl_hooks(progress_cb=None, should_abort=None)— a context manager that binds live-progress + OOM-abort hooks for anyRLLooprun in its scope (rl-lane-autonomy LA-8/10). The Arenarl_rundispatch enters it around the loop so an unmodifiedrun_rl_loopstill reports live step state (intojobs.result_json) and respects a watchdog abort — the hooks ride acontextvars.ContextVar, so they are thread-/task-local and flow arena → rl only (fieldkit.rlnever importsfieldkit.arena). ExplicitRLLoop(progress_cb=…, should_abort=…)injection wins; otherwise the loop falls back to the ambient hooks. A bareimport fieldkit.rlleaves them unset → zero overhead.current_rl_hooks()— the ambient(progress_cb, should_abort)for the current scope ((None, None)when norl_hooksis active).RLLoop.runreads this when its own hook fields are unset.
How Arena dispatches it
fieldkit.arena.jobs promotes the pre-drilled rl_run / requant job kinds
into DISPATCHABLE (RV-6); the M11 cron drains them single-lane overnight behind
the budget governor. The run is async/overnight only — the 8.5 h GRPO loop
can’t be a synchronous cockpit click, so the server.py POST /api/jobs
allowlist stays narrow and rl_run reaches the dispatcher via enqueue_job. On
a held-out win, an optional requant re-quantizes the lifted checkpoint and the
loop closes visibly on the leaderboard. No new arena.db table, no
user_version bump (RV-8): the trajectory rides fieldkit.lineage, the
held-out scores ride eval_runs.
Operator run (the GPU path is armed, not automatic)
A live run is a deliberate operator action — the orchestration is GPU-free, but the rollout/train loop needs the GPU stack on the box:
-
Install the extra —
pip install "fieldkit[rl]"(torch + peft + transformers + protobuf + sentencepiece + safetensors + accelerate). These are pip-installable on the GB10.The proven GB10 stack (BUG-4, e2e smoke 2026-06-06). The extra is ceiling-pinned to the combo a real 4-step GRPO
rl_runvalidated:transformers==4.51.3+protobuf+sentencepiece+torch 2.12.0+cu130peft 0.19.1. Do not “upgrade past” it casually — the drift chain it pins out: transformers 5.x crashesdevice_map="auto"on GB10 (Cannot copy out of meta tensor); 4.5x without protobuf/sentencepiece fails the qwen3 tokenizer load; and no 4.x accepts a NeMo-exportedtokenizer_config.jsoncarrying list-formextra_special_tokens.
The shadow model dir (the NeMo-export fix). A NeMo-merged HF export may write
extra_special_tokensas a list (and may be root-owned, so unfixable in place). Build a shadow dir: symlink every file from the export, replace onlytokenizer_config.jsonwith a copy whoseextra_special_tokenskey is dropped, then pointFK_RL_BASE_MODELat the shadow. vLLM keeps serving the original (identical weights); only the trainer-side tokenizer load needs the fixed config. On the Spark:merged-hf-bf16-fixed/shadowsmerged-hf-bf16/.Venue: keep the RL venv OFF
/tmp(reboot-ephemeral — every reboot silently re-resolves the stack; that is exactly how BUG-4 shipped). -
Serve a pinned vLLM separately. vLLM is not a dependency of the extra: there is no aarch64+CUDA-13 wheel for the pinned version yet (
[[project_verl_atgpo_vllm_gap]]), so it is brought up as its own process (or container) serving the base model with--enable-lora.VLLMLanethen kill-and-restarts it between steps to load each lifted adapter. -
Point the seams at the box via environment (the seam signature
gpu_seams(config)stays stable — the operator tunes the box, not the API):Var Meaning Default FK_RL_VLLM_URLlocal vLLM OpenAI base http://localhost:8000/v1FK_RL_BASE_MODELHF base id to serve + train Qwen/Qwen2.5-7B-InstructFK_RL_ADAPTER_INITSFT-init LoRA the run starts from (required) FK_RL_WORK_DIRwhere per-step adapters are written ~/.fieldkit/rlFK_RL_LORA_NAMEserved LoRA module == chat modelpolicyFK_RL_GPU_UTILvLLM --gpu-memory-utilization(one-lane, RV-10)0.55FK_RL_HELDOUT_TEMPheld-out-eval temperature (< train temp) 0.2FK_RL_SERVE_CMD/FK_RL_STOP_CMDfull serve/stop overrides (e.g. docker exec)host vllm/ EngineCore-awarepkill -
Dispatch overnight, never synchronously (RV-6) —
rl_runenqueues into the M8 dispatcher and drains under the M11 single-lane cron behind the budget governor. The ~8.5 h loop is not a cockpit click.
Until those are in place, inject your own sampler / trainer / heldout_eval
into RLLoop directly — the orchestration is fully functional and tested with
fakes.
Operator: full autonomy (rl-lane-autonomy v1)
Once a pinned vLLM is installed, the hand-driven overnight session becomes one
armed policy. fieldkit.arena.lane (LA-1..11) wraps the dispatchable rl_run
so it is self-driving, observable, and self-defending:
- Arm the cron once —
fieldkit arena autonomy on --interval-min 30 --cap-usd 5records the policy + prints the crontab line that runsfieldkit arena drainon a cadence (--install-cronwrites it for you).fieldkit arena autonomy statusreports it;fieldkit arena autonomy offis the one-command reversal. - Each drain tick acquires the single-drain lock and, for an
rl_run, enters theLaneArbiter: a 3-way pre-flight (governor allow ∧ envelope fits ∧ a vLLM binary present — else a cleandeferwithLANE_BIN_ABSENT, never a crash), then it frees the resident chat brain (FK_RL_RESIDENT_STOP_CMD/FK_RL_RESIDENT_START_CMD), runs under aMemoryWatchdog, and always restores the prior lane on exit. - It reports live — the loop writes throttled
{step, phase, pool_score, last_heldout, eta_s, mem}intojobs.result_json(LA-8); the cockpit Jobs board renders an inline progress strip with the pool-vs-held-out read, so the t2po inversion (RV-4) is visible as it happens. - It defends the box — the
MemoryWatchdogenforces a unified-memory headroom floor (FK_RL_OOM_FLOOR_GB, default 4; warn atFK_RL_OOM_WARN_GB, default 8). A breach that persists ~2 s touches an abort sentinel the loop polls between steps → the run tears down before the kernel OOM-kills ([[project_spark_unified_memory_oom]]). Every run attaches amem_trace(peak / headroom / per-phase) to its lineage + the morning standup (“RAN 1 · peak 119 GB · 1 OOM-deferred”).
The autonomy gate stays operator-armed — one autonomy on, not per run —
and the $/day governor cap + the single-lane lock still bound every tick.
What it is not
It does not re-implement a GPU trainer (the proven loop is injected via the
seams / vendored later), publish a kind: rl_run artifact (RV-9 defers
publishable kinds to second-vertical reuse), or churn the schema. v1 ships the
loop, not the storefront.